This tutorial is the second part of the second tutorial aimed to equip you with additional tools and techniques to advance your analysis for the Stones 2 Milestones (S2M) path analysis project. While the previous tutorials focus on understanding basic facts about the users as well as their behavior patterns on the app, this tutorial focuses on transitioning from performing purely descriptive analysis to building predictive machine learning models and furthering the previous tutorial’s cohort analysis using machine learning techniques. In addition, this tutorial shows you how to formulate business-relevant questions that can be answered by utilizing the appropriate machine learning approach(es). We focus on two broad questions in this tutorial:
Each section of this tutorial approaches one of these questions. We first discuss the relevance of the question, then how we approach answering this question. As you read this tutorial, consider how else you might approach answering these questions as well as what other business-relevant questions you might answer with these approaches.
First, we repeat some of the preliminaries from the previous tutorials. We load the necessary packages, then load the processed datasets. The processed datasets include the filtered user-level data and filtered user-story interactions data. Please refer to the Preliminaries section in the first part of the second tutorial where we save these processed datasets (using the write.csv function) in order to understand which filters are applied to the dataframes before saving them as csv files.
# Ensure that pacman is installed for package management and loading.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse) # for data reading wrangling and visualization# for enabling dataframe manipulation (0.8.0.1)
pacman::p_load(dplyr)
# for simple interface for OLS estimation w/ robust std errors ()
pacman::p_load(estimatr)
# for summary statistics (3042.89)
pacman::p_load(fBasics)
# for data visualization
pacman::p_load(ggplot2)
# for easily highlighting lines and points in a ggplot
pacman::p_load(gghighlight)
# for working with "grid" graphics
pacman::p_load(gridExtra)
# for providing a prettier RMarkdown (1.0.1)
pacman::p_load(kableExtra)
# for providing a general-purpose tool for dynamic report generation (1.21)
pacman::p_load(knitr)
# for dealing with date-type data
pacman::p_load(lubridate)
# for computing the mode of a vector
pacman::p_load(modeest)
# for reading csv files (1.3.1)
pacman::p_load(readr)
# for modeling, transforming, and visualizing data (0.8.0.1)
pacman::p_load(tidyverse)
# for modern alternative to data frames (2.1.1)
pacman::p_load(tibble)
# for simplifying the process of creating tidy data
pacman::p_load(tidyr)
# for internal scaling of graph labels and text
pacman::p_load(scales)
# for cross-validating glm models
pacman::p_load(boot)
# for streamlining the model training process
pacman::p_load(caret)
# for enabling fast implementation of random forests
pacman::p_load(randomForest)
# for clustering algorithms
pacman::p_load(cluster)
# for clustering algorithms & visualization
pacman::p_load(factoextra)
# for storing data as tables, data frames, or list objects
pacman::p_load(data.table)
# for cross-validating regressions
pacman::p_load(cvms)
# for progress bars
pacman::p_load(plyr)
# for building regression models
pacman::p_load(glmnet)
# for variance estimations
pacman::p_load(plm)
# for testing linear models
pacman::p_load(lmtest)
# for summarizing data
pacman::p_load(ggiraphExtra)
# for advance predictive modeling
pacman::p_load(e1071)# Load the filtered user-level data stored as a CSV file
filtered_child_df <- read.csv(file='filtered_child_df.csv')
# Load the filtered dataframe from the cohort analysis tutorial
filtered_logged_df <- read.csv(file='filtered_logged_df.csv')We will use a machine learning clustering technique known as k-means clustering to divide users into groups based on their behavioral characteristics (also known as user segmentation). This technique will help us to identify patterns in app usage behavior across different user groups.
Clustering is a set of techniques used for finding subgroups of observations within a data set. It falls into the unsupervised machine learning category because it seeks to find relationships between observations without using a dependent variable y to train on. K-means clustering is the simplest and most commonly used clustering technique that splits a dataset into k clusters based on a specified set of observation characteristics.
In order to perform clustering analysis, we transform the raw data into features that better represent the underlying problem to our model. This transformation is known as feature engineering and is a very important step of building robust machine learning models. When done properly, it can significantly improve model performance.
We start by building features based on users’ initial behavioral characteristics. These user characteristics are similar to the ones you have already analyzed in the previous tutorials.
Before we continue with the clustering analysis, we have to ensure that the data is preprocessed properly by:
# Print a random row to ensure it is an observation and that each corresp. column of that row is a feature
print(clustering_features_df[15, ])## # A tibble: 1 x 18
## # Groups: child_id [1]
## child_id n_stories_firstw… avg_n_stories_fi… n_sessions_firs… unique_stories_…
## <int> <int> <dbl> <int> <int>
## 1 280 19 4.75 4 16
## # … with 13 more variables: unique_sources_firstwk <int>,
## # n_stories_first2wkly_sessions <int>,
## # avg_n_stories_first2wkly_sessions <dbl>, avg_n_stories_first2wks <dbl>,
## # n_sessions_first2wks <int>, unique_stories_first2wks <int>,
## # unique_sources_first2wks <int>, n_stories_first3wkly_sessions <int>,
## # avg_n_stories_first3wkly_sessions <dbl>, avg_n_stories_first3wks <dbl>,
## # n_sessions_first3wks <int>, unique_stories_first3wks <int>,
## # unique_sources_first3wks <int>
# Ensure we removed all missing values from the data
print(sum(is.na(clustering_features_df)))## [1] 0
# Standardize the features used for clustering -- ensure child_id is not preprocessed
clustering_df <- clustering_features_df %>% ungroup()
vars<-clustering_df %>% select(-c(child_id))
vars<-c(colnames(vars))
pre_proc_val <- preProcess(clustering_df[,vars], method = c("center", "scale"))
clustering_df[,vars] = predict(pre_proc_val, clustering_df[,vars])
rownames(clustering_df) <- clustering_df$child_id
clustering_df$child_id <- NULL To split users into subgroups, we use k-means clustering, which is the most commonly used clustering technique for partitioning a dataset into a set of k clusters where k (the number of clusters) is pre-determined by the data scientist.
The way k-means clusters observations into groups is by maximizing the similarity between observations within the same cluster (i.e. high intra-cluster similarity) while minimizing the similarity between observations from different clusters (i.e. low inter-cluster similarity).
In addition, the reason we call this technique “k-means” clustering is because it represents each cluster by a center (i.e. centroid) that is computed as the mean of the points assigned to that cluster. To assign an observation \(x_i\) to a cluster, the algorithm computes the Euclidean distance between the observation’s features and the cluster mean in order to determine which cluster’s centroid is closest to \(x_i\) and assign \(x_i\) to the closest cluster accordingly.
We randomly pick k to be 4 (i.e. centers= 4) and, consequently, partition the users into four clusters based on their initial behavioral characteristics on the app.
We begin by partitioning the users into four clusters (centers= 4) based on the users’ initial behavioral characteristics on the app. We also set nstart equal to 25 so that the k-means function randomly generates 25 initial configurations and chooses the best one. In our case, k-means will repeat the process of randomly picking 4 centroids (i.e. observations) to initialize the 4 clusters 25 times. By setting nstart=25, we try to optimize the process of initializing clusters, so that the k-means model becomes stable. For example, if the algorithm randomly picks observations that are outliers as the initial centroids then the clustering will not be stable. When the clustering is not stable, the algorithm might end up assigning observations to different clusters each time the clusters are reinitialized.
set.seed(1234)
# Perform k-means clustering with k=4
k4 <- kmeans(clustering_df, centers = 4, nstart = 25)
str(k4)## List of 9
## $ cluster : int [1:17763] 3 3 1 3 1 3 3 1 1 3 ...
## $ centers : num [1:4, 1:17] 0.172 2.084 -0.415 10.027 0.21 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "1" "2" "3" "4"
## .. ..$ : chr [1:17] "n_stories_firstwkly_session" "avg_n_stories_firstwk" "n_sessions_firstwk" "unique_stories_firstwk" ...
## $ totss : num 301954
## $ withinss : num [1:4] 43879 31951 24530 18666
## $ tot.withinss: num 119024
## $ betweenss : num 182930
## $ size : int [1:4] 6616 1212 9891 44
## $ iter : int 3
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
# Visualize clustering results using fviz_cluster (which performs PCA to plot data points in higher dimensions)
fviz_cluster(k4, clustering_df)The plot above enables us to visualize the assignment of clusters for all data points after performing principal component analysis (PCA). PCA allows us to reduce the dimensionality of the data since we have more than two features (i.e. dimensions) in our data. The axes in the plot correspond to the first two principal components that explain the majority of the variance in our data.
We then run k-means with k=5 and k=3, in addition to k=4, to compare the results of this clustering algorithm with varying k values.
# Perform k-means clustering with k=5
k5 <- kmeans(clustering_df, centers = 5, nstart = 25)
str(k5)## List of 9
## $ cluster : int [1:17763] 5 5 4 5 4 5 4 4 4 4 ...
## $ centers : num [1:5, 1:17] 3.1291 11.6804 0.8224 -0.0859 -0.4447 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5] "1" "2" "3" "4" ...
## .. ..$ : chr [1:17] "n_stories_firstwkly_session" "avg_n_stories_firstwk" "n_sessions_firstwk" "unique_stories_firstwk" ...
## $ totss : num 301954
## $ withinss : num [1:5] 20120 13997 26215 28525 14792
## $ tot.withinss: num 103648
## $ betweenss : num 198306
## $ size : int [1:5] 492 29 2764 6375 8103
## $ iter : int 4
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
# Visualize clustering results using fviz_cluster (which performs PCA to plot data points in higher dimensions)
fviz_cluster(k5, clustering_df)# Perform k-means clustering with k=3
k3 <- kmeans(clustering_df, centers = 3, nstart = 25)
str(k3)## List of 9
## $ cluster : int [1:17763] 3 3 3 3 2 3 3 3 3 3 ...
## $ centers : num [1:3, 1:17] 8.189 0.75 -0.353 5.86 0.712 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:17] "n_stories_firstwkly_session" "avg_n_stories_firstwk" "n_sessions_firstwk" "unique_stories_firstwk" ...
## $ totss : num 301954
## $ withinss : num [1:3] 27337 76777 46460
## $ tot.withinss: num 150575
## $ betweenss : num 151379
## $ size : int [1:3] 78 5081 12604
## $ iter : int 3
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
# Visualize clustering results using fviz_cluster (which performs PCA to plot data points in higher dimensions)
fviz_cluster(k3, clustering_df)As can be observed from the results and plots of the k-means clustering with k=3, k=4, and k=5, the cluster with the highest values for both principal components (denoted as Dim1 and Dim2 in the plots) is the one that has a significantly lower number of users compared to other clusters. Similarly, the cluster that has the second-to-highest values for both principal components is the one that has a much smaller number of users compared to the clusters with lower values for both PCs. Intuitively, this makes sense because overall, in our dataset, there are significantly fewer users in the highest quartile of user engagement.
Another interesting observation is that as we increase the number of clusters (i.e. k=5), more clusters form around the lower values of principal components rather than the higher values. This increased refinement of user subgroups at the lower values could lead to some interesting cohort analysis for those clusters.
Given that the k-means clustering algorithm allows us to specify the number of clusters generated, we can also use various methods for determining the optimal number of clusters. The three most popular methods are: the elbow method, the silhouette method, and gap statistic. We will be implementing the average silhouette method (described below), but you should feel free to implement whichever method you like. For more details on the elbow method, see here, and on the gap statistic, see here.
It is also important to note that in practice, the data-driven approach for determining the number of clusters is not usually strictly followed, but it gives a starting point.
The silhouette method measures the quality of a clustering by determining how well each observation lies within its assigned cluster. First, it computes the silhouette coefficient of each data point, which measures how much a point is similar to its own cluster (cohesion) compared to other clusters (separation). Then, it averages out the silhouette coefficient for all data points to obtain the silhouette score.
More specifically, the silhouette coefficient is calculated using the mean intra-cluster distance (\(d_{intra}\)) and the mean nearest-cluster distance (\(d_{nearest\_cluster}\)) for each sample. The silhouette coefficient for a sample is \((d_{nearest\_cluster}-d_{intra}) / max(d_{intra}, d_{nearest\_cluster})\) where \(d_{nearest\_cluster}\) is the distance between a data point and the nearest cluster that the point is not a part of.
The silhouette score ranges between [-1, 1]. A high value is desirable because it implies that the data point is very similar to the other points in its cluster and not similar to the points in other clusters. On the other hand, a low silhouette score implies that the clustering configuration has to be changed (i.e. data points have to be reassigned to new clusters) as there could be too few or too many clusters in the current configuration.
We now plot the silhouette score to better understand how the observations are grouped into k clusters.
# function to compute silhouette score for k clusters -- takes to long to run so will not be automatically evaluated
sil_score <- function(k) {
km.res <- kmeans(clustering_df, centers = k, nstart = 25)
ss <- silhouette(km.res$cluster, dist(clustering_df))
mean(ss[, 3])
}
# Compute and plot wss for k = 2 to k = 10
k.values <- 2:10
# extract silhouette scores for 2-10 clusters
sil_values <- map_dbl(k.values, sil_score)
plot(k.values, sil_values,
type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters k",
ylab = "Silhouette Score")Thus, as can be seen from the silhouette plot above, the k with the highest silhouette score is 2 and the k with the second highest silhouette score is 3. We can also see that, in general, the silhouette score decreases as the number of clusters increases.
fviz_nbclust(clustering_df, kmeans, method = "silhouette")Change the input features (i.e. input variables) used to build the k-means clustering model and investigate the new clusters. How do these clusters differ from the ones we have obtained from the k-means model above?
In addition, vary the number of clusters generated by changing the k parameters in the k-means model. Observing the user clusters obtained using different k values, are there are any interesting patterns or results?
We try clustering with the same variables created for weeks 4, 5, and 6 as well
# total number of stories in first 4 weekly sessions
tot_num_stories_first4wkly_session = filtered_logged_df %>%
dplyr::filter(days_since_signup < 28) %>%
select(child_id, story_id)%>%
group_by(child_id) %>%
dplyr::summarize(n_stories_first4wkly_sessions = n())
clustering_features_df <- left_join(x = clustering_features_df, y = tot_num_stories_first4wkly_session %>%
select(child_id, n_stories_first4wkly_sessions),
by = 'child_id')
# average number of stories in first 4 weekly sessions
tot_num_stories_first4wkly_session = filtered_logged_df %>%
dplyr::filter(days_since_signup < 28) %>%
select(child_id, sessions_since_signup, story_id) %>%
group_by(child_id, sessions_since_signup) %>%
dplyr::summarize(n_stories_first4wkly_sessions = n()) %>%
group_by(child_id) %>%
dplyr::summarize(avg_n_stories_first4wkly_sessions = mean(n_stories_first4wkly_sessions))
clustering_features_df <- left_join(x = clustering_features_df, y = tot_num_stories_first4wkly_session %>%
select(child_id, avg_n_stories_first4wkly_sessions),
by = 'child_id')
# average number of stories per daily session in first 4 weeks
avg_num_stories_daily_session_first4wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 28) %>%
select(child_id, days_since_signup, story_id)%>%
group_by(child_id, days_since_signup) %>%
dplyr::summarize(n_stories_first4wks = n()) %>%
group_by(child_id) %>%
dplyr::summarize(avg_n_stories_first4wks = mean(n_stories_first4wks))
clustering_features_df <- left_join(x = clustering_features_df, y = avg_num_stories_daily_session_first4wks %>%
select(child_id, avg_n_stories_first4wks),
by = 'child_id')
# number of unique daily sessions in first 4 weeks
num_daily_sessions_first4wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 28) %>%
select(child_id, days_since_signup)%>%
group_by(child_id) %>%
dplyr::summarize(n_sessions_first4wks = n_distinct(days_since_signup))
clustering_features_df <- left_join(x = clustering_features_df, y = num_daily_sessions_first4wks %>%
select(child_id, n_sessions_first4wks),
by = 'child_id')
# number of unique stories viewed in first 4 weeks
num_unique_stories_first4wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 28) %>%
select(child_id, story_id)%>%
group_by(child_id) %>%
dplyr::summarize(unique_stories_first4wks = n_distinct(story_id))
clustering_features_df <- left_join(x = clustering_features_df, y = num_unique_stories_first4wks %>%
select(child_id, unique_stories_first4wks),
by = 'child_id')
# number of unique app features used in first 4 weeks
num_unique_sources_first4wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 28) %>%
select(child_id, source_page_id)%>%
group_by(child_id) %>%
dplyr::summarize(unique_sources_first4wks = n_distinct(source_page_id))
clustering_features_df <- left_join(x = clustering_features_df, y = num_unique_sources_first4wks %>%
select(child_id, unique_sources_first4wks),
by = 'child_id')# total number of stories in first 5 weekly sessions
tot_num_stories_first5wkly_session = filtered_logged_df %>%
dplyr::filter(days_since_signup < 35) %>%
select(child_id, story_id)%>%
group_by(child_id) %>%
dplyr::summarize(n_stories_first5wkly_sessions = n())
clustering_features_df <- left_join(x = clustering_features_df, y = tot_num_stories_first5wkly_session %>%
select(child_id, n_stories_first5wkly_sessions),
by = 'child_id')
# average number of stories in first 5 weekly sessions
tot_num_stories_first5wkly_session = filtered_logged_df %>%
dplyr::filter(days_since_signup < 35) %>%
select(child_id, sessions_since_signup, story_id) %>%
group_by(child_id, sessions_since_signup) %>%
dplyr::summarize(n_stories_first5wkly_sessions = n()) %>%
group_by(child_id) %>%
dplyr::summarize(avg_n_stories_first5wkly_sessions = mean(n_stories_first5wkly_sessions))
clustering_features_df <- left_join(x = clustering_features_df, y = tot_num_stories_first5wkly_session %>%
select(child_id, avg_n_stories_first5wkly_sessions),
by = 'child_id')
# average number of stories per daily session in first 5 weeks
avg_num_stories_daily_session_first5wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 35) %>%
select(child_id, days_since_signup, story_id)%>%
group_by(child_id, days_since_signup) %>%
dplyr::summarize(n_stories_first5wks = n()) %>%
group_by(child_id) %>%
dplyr::summarize(avg_n_stories_first5wks = mean(n_stories_first5wks))
clustering_features_df <- left_join(x = clustering_features_df, y = avg_num_stories_daily_session_first5wks %>%
select(child_id, avg_n_stories_first5wks),
by = 'child_id')
# number of unique daily sessions in first 5 weeks
num_daily_sessions_first5wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 35) %>%
select(child_id, days_since_signup)%>%
group_by(child_id) %>%
dplyr::summarize(n_sessions_first5wks = n_distinct(days_since_signup))
clustering_features_df <- left_join(x = clustering_features_df, y = num_daily_sessions_first5wks %>%
select(child_id, n_sessions_first5wks),
by = 'child_id')
# number of unique stories viewed in first 5 weeks
num_unique_stories_first5wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 35) %>%
select(child_id, story_id)%>%
group_by(child_id) %>%
dplyr::summarize(unique_stories_first5wks = n_distinct(story_id))
clustering_features_df <- left_join(x = clustering_features_df, y = num_unique_stories_first5wks %>%
select(child_id, unique_stories_first5wks),
by = 'child_id')
# number of unique app features used in first 5 weeks
num_unique_sources_first5wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 35) %>%
select(child_id, source_page_id)%>%
group_by(child_id) %>%
dplyr::summarize(unique_sources_first5wks = n_distinct(source_page_id))
clustering_features_df <- left_join(x = clustering_features_df, y = num_unique_sources_first5wks %>%
select(child_id, unique_sources_first5wks),
by = 'child_id')# total number of stories in first 6 weekly sessions
tot_num_stories_first6wkly_session = filtered_logged_df %>%
dplyr::filter(days_since_signup < 42) %>%
select(child_id, story_id)%>%
group_by(child_id) %>%
dplyr::summarize(n_stories_first6wkly_sessions = n())
clustering_features_df <- left_join(x = clustering_features_df, y = tot_num_stories_first6wkly_session %>%
select(child_id, n_stories_first6wkly_sessions),
by = 'child_id')
# average number of stories in first 6 weekly sessions
tot_num_stories_first6wkly_session = filtered_logged_df %>%
dplyr::filter(days_since_signup < 42) %>%
select(child_id, sessions_since_signup, story_id) %>%
group_by(child_id, sessions_since_signup) %>%
dplyr::summarize(n_stories_first6wkly_sessions = n()) %>%
group_by(child_id) %>%
dplyr::summarize(avg_n_stories_first6wkly_sessions = mean(n_stories_first6wkly_sessions))
clustering_features_df <- left_join(x = clustering_features_df, y = tot_num_stories_first6wkly_session %>%
select(child_id, avg_n_stories_first6wkly_sessions),
by = 'child_id')
# average number of stories per daily session in first 6 weeks
avg_num_stories_daily_session_first6wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 42) %>%
select(child_id, days_since_signup, story_id)%>%
group_by(child_id, days_since_signup) %>%
dplyr::summarize(n_stories_first6wks = n()) %>%
group_by(child_id) %>%
dplyr::summarize(avg_n_stories_first6wks = mean(n_stories_first6wks))
clustering_features_df <- left_join(x = clustering_features_df, y = avg_num_stories_daily_session_first6wks %>%
select(child_id, avg_n_stories_first6wks),
by = 'child_id')
# number of unique daily sessions in first 6 weeks
num_daily_sessions_first6wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 42) %>%
select(child_id, days_since_signup)%>%
group_by(child_id) %>%
dplyr::summarize(n_sessions_first6wks = n_distinct(days_since_signup))
clustering_features_df <- left_join(x = clustering_features_df, y = num_daily_sessions_first6wks %>%
select(child_id, n_sessions_first6wks),
by = 'child_id')
# number of unique stories viewed in first 6 weeks
num_unique_stories_first6wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 42) %>%
select(child_id, story_id)%>%
group_by(child_id) %>%
dplyr::summarize(unique_stories_first6wks = n_distinct(story_id))
clustering_features_df <- left_join(x = clustering_features_df, y = num_unique_stories_first6wks %>%
select(child_id, unique_stories_first6wks),
by = 'child_id')
# number of unique app features used in first 6 weeks
num_unique_sources_first6wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 42) %>%
select(child_id, source_page_id)%>%
group_by(child_id) %>%
dplyr::summarize(unique_sources_first6wks = n_distinct(source_page_id))
clustering_features_df <- left_join(x = clustering_features_df, y = num_unique_sources_first6wks %>%
select(child_id, unique_sources_first6wks),
by = 'child_id')clustering_features_df_added<-clustering_features_df
clustering_features_df<-clustering_features_df[,-c(18:36)]# Print a random row to ensure it is an observation and that each corresp. column of that row is a feature
print(clustering_features_df_added[15, ])## # A tibble: 1 x 36
## # Groups: child_id [1]
## child_id n_stories_firstw… avg_n_stories_fi… n_sessions_firs… unique_stories_…
## <int> <int> <dbl> <int> <int>
## 1 280 19 4.75 4 16
## # … with 31 more variables: unique_sources_firstwk <int>,
## # n_stories_first2wkly_sessions <int>,
## # avg_n_stories_first2wkly_sessions <dbl>, avg_n_stories_first2wks <dbl>,
## # n_sessions_first2wks <int>, unique_stories_first2wks <int>,
## # unique_sources_first2wks <int>, n_stories_first3wkly_sessions <int>,
## # avg_n_stories_first3wkly_sessions <dbl>, avg_n_stories_first3wks <dbl>,
## # n_sessions_first3wks <int>, unique_stories_first3wks <int>,
## # unique_sources_first3wks <int>, n_stories_first4wkly_sessions <int>,
## # avg_n_stories_first4wkly_sessions <dbl>, avg_n_stories_first4wks <dbl>,
## # n_sessions_first4wks <int>, unique_stories_first4wks <int>,
## # unique_sources_first4wks <int>, n_stories_first5wkly_sessions <int>,
## # avg_n_stories_first5wkly_sessions <dbl>, avg_n_stories_first5wks <dbl>,
## # n_sessions_first5wks <int>, unique_stories_first5wks <int>,
## # unique_sources_first5wks <int>, n_stories_first6wkly_sessions <int>,
## # avg_n_stories_first6wkly_sessions <dbl>, avg_n_stories_first6wks <dbl>,
## # n_sessions_first6wks <int>, unique_stories_first6wks <int>,
## # unique_sources_first6wks <int>
# Ensure we removed all missing values from the data
print(sum(is.na(clustering_features_df_added)))## [1] 0
# Standardize the features used for clustering -- ensure child_id is not preprocessed
clustering_df_added <- clustering_features_df_added %>% ungroup()
vars<-clustering_df_added %>% select(-c(child_id))
vars<-c(colnames(vars))
pre_proc_val <- preProcess(clustering_df_added[,vars], method = c("center", "scale"))
clustering_df_added[,vars] = predict(pre_proc_val, clustering_df_added[,vars])
rownames(clustering_df_added) <- clustering_df_added$child_id
clustering_df_added$child_id <- NULL # Perform k-means clustering with k=3
k3_new <- kmeans(clustering_df_added, centers = 3, nstart = 25)
str(k3_new)## List of 9
## $ cluster : int [1:17763] 2 2 2 2 2 2 2 2 1 2 ...
## $ centers : num [1:3, 1:35] 0.715 -0.34 6.332 0.678 -0.311 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:35] "n_stories_firstwkly_session" "avg_n_stories_firstwk" "n_sessions_firstwk" "unique_stories_firstwk" ...
## $ totss : num 621670
## $ withinss : num [1:3] 153041 95293 67268
## $ tot.withinss: num 315603
## $ betweenss : num 306067
## $ size : int [1:3] 5000 12649 114
## $ iter : int 3
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
# Visualize clustering results using fviz_cluster (which performs PCA to plot data points in higher dimensions)
fviz_cluster(k3_new, clustering_df_added)# function to compute silhouette score for k clusters -- takes to long to run so will not be automatically evaluated, uncomment to separately calculate
# sil_score_updated <- function(k) {
# km.res <- kmeans(clustering_df_added, centers = k, nstart = 25)
# ss <- silhouette(km.res$cluster, dist(clustering_df_added))
# mean(ss[, 3])
# }
#
# # Compute and plot wss for k = 2 to k = 10
# k.values <- 2:10
#
# # extract silhouette scores for 2-10 clusters
# sil_values_updated <- map_dbl(k.values, sil_score_updated)
#
# plot(k.values, sil_values_updated,
# type = "b", pch = 19, frame = FALSE,
# xlab = "Number of clusters k",
# ylab = "Silhouette Score")fviz_nbclust(clustering_df_added, kmeans, method = "silhouette")Since the variables for each subsequent week capture information from previous weeks as well, we also ran a clustering with only week 6 variables, and saw a significant improvement in the silhouette score:
clustering_df_added_wk6<-clustering_df_added[,30:35]
fviz_nbclust(clustering_df_added_wk6, kmeans, method = "silhouette")# Uncomment separately calculate using function
# sil_score_updated_wk6 <- function(k) {
# km.res <- kmeans(clustering_df_added_wk6, centers = k, nstart = 25)
# ss <- silhouette(km.res$cluster, dist(clustering_df_added_wk6))
# mean(ss[, 3])
# }
#
# sil_values_updated_wk6 <- map_dbl(k.values, sil_score_updated_wk6)Here we engineering week 9 variables to serve as response variables in the predictive modeling
# These are additional response variables we created for the predictive modeling
# total number of stories in first 9 weekly sessions
tot_num_stories_first9wkly_session = filtered_logged_df %>%
dplyr::filter(days_since_signup < 63) %>%
select(child_id, story_id)%>%
group_by(child_id) %>%
dplyr::summarize(n_stories_first9wkly_sessions = n())
clustering_features_df_addedwk9 <- left_join(x = clustering_features_df_added, y = tot_num_stories_first9wkly_session %>%
select(child_id, n_stories_first9wkly_sessions),
by = 'child_id')
# average number of stories in first 9 weekly sessions
tot_num_stories_first9wkly_session = filtered_logged_df %>%
dplyr::filter(days_since_signup < 63) %>%
select(child_id, sessions_since_signup, story_id) %>%
group_by(child_id, sessions_since_signup) %>%
dplyr::summarize(n_stories_first9wkly_sessions = n()) %>%
group_by(child_id) %>%
dplyr::summarize(avg_n_stories_first9wkly_sessions = mean(n_stories_first9wkly_sessions))
clustering_features_df_addedwk9 <- left_join(x = clustering_features_df_addedwk9, y = tot_num_stories_first9wkly_session %>%
select(child_id, avg_n_stories_first9wkly_sessions),
by = 'child_id')
# average number of stories per daily session in first 9 weeks
avg_num_stories_daily_session_first9wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 63) %>%
select(child_id, days_since_signup, story_id)%>%
group_by(child_id, days_since_signup) %>%
dplyr::summarize(n_stories_first9wks = n()) %>%
group_by(child_id) %>%
dplyr::summarize(avg_n_stories_first9wks = mean(n_stories_first9wks))
clustering_features_df_addedwk9 <- left_join(x = clustering_features_df_addedwk9, y = avg_num_stories_daily_session_first9wks %>%
select(child_id, avg_n_stories_first9wks),
by = 'child_id')
# number of unique daily sessions in first 9 weeks
num_daily_sessions_first9wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 63) %>%
select(child_id, days_since_signup)%>%
group_by(child_id) %>%
dplyr::summarize(n_sessions_first9wks = n_distinct(days_since_signup))
clustering_features_df_addedwk9 <- left_join(x = clustering_features_df_addedwk9, y = num_daily_sessions_first9wks %>%
select(child_id, n_sessions_first9wks),
by = 'child_id')
# number of unique stories viewed in first 9 weeks
num_unique_stories_first9wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 63) %>%
select(child_id, story_id)%>%
group_by(child_id) %>%
dplyr::summarize(unique_stories_first9wks = n_distinct(story_id))
clustering_features_df_addedwk9 <- left_join(x = clustering_features_df_addedwk9, y = num_unique_stories_first9wks %>%
select(child_id, unique_stories_first9wks),
by = 'child_id')
# number of unique app features used in first 9 weeks
num_unique_sources_first9wks = filtered_logged_df %>%
dplyr::filter(days_since_signup < 63) %>%
select(child_id, source_page_id)%>%
group_by(child_id) %>%
dplyr::summarize(unique_sources_first9wks = n_distinct(source_page_id))
clustering_features_df_addedwk9 <- left_join(x = clustering_features_df_addedwk9, y = num_unique_sources_first9wks %>%
select(child_id, unique_sources_first9wks),
by = 'child_id')Here we created the temporal variables for clustering
user_num_and_max_sessions<-filtered_logged_df %>%
select(days_since_signup, child_id) %>%
group_by(child_id) %>%
dplyr::summarize(max_days_since_signup = max(days_since_signup),
num_days_active=n(),
prop_active=num_days_active/(max_days_since_signup))
clustering_features_df_added<-left_join(clustering_features_df_added, user_num_and_max_sessions,by="child_id")clustering_features_df_addedwk9<-left_join(clustering_features_df_addedwk9, user_num_and_max_sessions,by="child_id")# Print a random row to ensure it is an observation and that each corresp. column of that row is a feature
print(clustering_features_df_added[15, ])## # A tibble: 1 x 39
## # Groups: child_id [1]
## child_id n_stories_firstw… avg_n_stories_fi… n_sessions_firs… unique_stories_…
## <int> <int> <dbl> <int> <int>
## 1 280 19 4.75 4 16
## # … with 34 more variables: unique_sources_firstwk <int>,
## # n_stories_first2wkly_sessions <int>,
## # avg_n_stories_first2wkly_sessions <dbl>, avg_n_stories_first2wks <dbl>,
## # n_sessions_first2wks <int>, unique_stories_first2wks <int>,
## # unique_sources_first2wks <int>, n_stories_first3wkly_sessions <int>,
## # avg_n_stories_first3wkly_sessions <dbl>, avg_n_stories_first3wks <dbl>,
## # n_sessions_first3wks <int>, unique_stories_first3wks <int>,
## # unique_sources_first3wks <int>, n_stories_first4wkly_sessions <int>,
## # avg_n_stories_first4wkly_sessions <dbl>, avg_n_stories_first4wks <dbl>,
## # n_sessions_first4wks <int>, unique_stories_first4wks <int>,
## # unique_sources_first4wks <int>, n_stories_first5wkly_sessions <int>,
## # avg_n_stories_first5wkly_sessions <dbl>, avg_n_stories_first5wks <dbl>,
## # n_sessions_first5wks <int>, unique_stories_first5wks <int>,
## # unique_sources_first5wks <int>, n_stories_first6wkly_sessions <int>,
## # avg_n_stories_first6wkly_sessions <dbl>, avg_n_stories_first6wks <dbl>,
## # n_sessions_first6wks <int>, unique_stories_first6wks <int>,
## # unique_sources_first6wks <int>, max_days_since_signup <int>,
## # num_days_active <int>, prop_active <dbl>
# Ensure we removed all missing values from the data
print(sum(is.na(clustering_features_df_added)))## [1] 0
# Standardize the features used for clustering -- ensure child_id is not preprocessed
clustering_df_added_temporal <- clustering_features_df_added %>% ungroup()
vars<-clustering_df_added_temporal %>% select(-c(child_id))
vars<-c(colnames(vars))
pre_proc_val <- preProcess(clustering_df_added_temporal[,vars], method = c("center", "scale"))
clustering_df_added_temporal[,vars] = predict(pre_proc_val, clustering_df_added_temporal[,vars])
rownames(clustering_df_added_temporal) <- clustering_df_added_temporal$child_id
# Used in configuration where we don't standardize temporal variables
clustering_df_added_temporal2<-clustering_df_added_temporal
clustering_df_added_temporal$child_id <- NULLHere we calculated the correlations between week 3 variables for variable selection
# Calculating correlations between week 3 variables
clustering_features_df_3wks<-clustering_features_df_added[,13:18]
correlations_3wks<-cor(clustering_features_df_3wks,method="pearson")
correlations_3wks<-as.matrix(correlations_3wks)clustering_subset<-clustering_df_added_temporal %>%
select(avg_n_stories_first3wkly_sessions, unique_sources_first3wks, n_sessions_first3wks, max_days_since_signup, prop_active)
# function to compute silhouette score for k clusters -- takes to long to run so will not be automatically evaluated, uncomment to evaluate
# sil_score_subset <- function(k) {
# km.res <- kmeans(clustering_subset, centers = k, nstart = 25)
# ss <- silhouette(km.res$cluster, dist(clustering_subset))
# mean(ss[, 3])
# }
#
# # Compute and plot wss for k = 2 to k = 10
# k.values <- 2:10
#
# # extract silhouette scores for 2-10 clusters
# sil_values_subset <- map_dbl(k.values, sil_score_subset)
#
# plot(k.values, sil_values_subset,
# type = "b", pch = 19, frame = FALSE,
# xlab = "Number of clusters k",
# ylab = "Silhouette Score")
fviz_nbclust(clustering_subset, kmeans, method = "silhouette")# if we don't standardize anything
clustering_subset_2<-clustering_features_df_added %>%
select(avg_n_stories_first3wkly_sessions, unique_sources_first3wks, n_sessions_first3wks, max_days_since_signup, prop_active)
clustering_subset_2<-clustering_subset_2[,-1]
# function to compute silhouette score for k clusters -- takes to long to run so will not be automatically evaluated, uncomment to evaluate
# sil_score_subset_2 <- function(k) {
# km.res <- kmeans(clustering_subset_2, centers = k, nstart = 25)
# ss <- silhouette(km.res$cluster, dist(clustering_subset_2))
# mean(ss[, 3])
# }
#
# # Compute and plot wss for k = 2 to k = 10
# k.values <- 2:10
#
# # extract silhouette scores for 2-10 clusters
# sil_values_subset_2 <- map_dbl(k.values, sil_score_subset_2)
#
# plot(k.values, sil_values_subset_2,
# type = "b", pch = 19, frame = FALSE,
# xlab = "Number of clusters k",
# ylab = "Silhouette Score")
fviz_nbclust(clustering_subset_2, kmeans, method = "silhouette")# if we don't standardize only the temporal variables
clustering_df_added_temporal_ns<-left_join(clustering_df_added_temporal2[,-c(37:39)], user_num_and_max_sessions,by="child_id")
clustering_subset_ns<-clustering_df_added_temporal_ns %>%
select(avg_n_stories_first3wkly_sessions, unique_sources_first3wks, n_sessions_first3wks, max_days_since_signup, prop_active)
# function to compute silhouette score for k clusters -- takes to long to run so will not be automatically evaluated, uncomment to evaluate
# sil_score_subset_ns <- function(k) {
# km.res <- kmeans(clustering_subset_ns, centers = k, nstart = 25)
# ss <- silhouette(km.res$cluster, dist(clustering_subset_ns))
# mean(ss[, 3])
# }
#
# # Compute and plot wss for k = 2 to k = 10
# k.values <- 2:10
#
# # extract silhouette scores for 2-10 clusters
# sil_values_subset_ns <- map_dbl(k.values, sil_score_subset_ns)
#
# plot(k.values, sil_values_subset_ns,
# type = "b", pch = 19, frame = FALSE,
# xlab = "Number of clusters k",
# ylab = "Silhouette Score")
fviz_nbclust(clustering_subset_ns, kmeans, method = "silhouette")set.seed(2021)
# All variables standardized
k3_subset<-kmeans(clustering_subset, centers = 3, nstart = 25)
set.seed(2021)
# Only Temporal variables not standardized
k3_subset_2<-kmeans(clustering_subset_2, centers = 3, nstart = 25)
set.seed(2021)
# All variables not standardized
k3_subset_ns<-kmeans(clustering_subset_ns, centers = 3, nstart = 25)
fviz_cluster(k3_subset, clustering_subset)fviz_cluster(k3_subset_2, clustering_subset_2)fviz_cluster(k3_subset_ns, clustering_subset_ns)# clustering_features_df_added_with_tenure<-clustering_features_df_added_with_tenure %>%
# mutate(k3_subset_2_pseudo=case_when(k3_subset_2_cluster==2 ~3,
# k3_subset_2_cluster==3 ~1,
# k3_subset_2_cluster==1 ~3))# Using week 6 variables
# if we don't standardize anything
clustering_subset_3<-clustering_features_df_added %>%
select(avg_n_stories_first6wkly_sessions, unique_sources_first6wks, n_sessions_first6wks, max_days_since_signup, prop_active)
clustering_subset_3<-clustering_subset_3[,-1]
# function to compute silhouette score for k clusters -- takes to long to run so will not be automatically evaluated, uncomment to evaluate
# sil_score_subset_3 <- function(k) {
# km.res <- kmeans(clustering_subset_3, centers = k, nstart = 25)
# ss <- silhouette(km.res$cluster, dist(clustering_subset_3))
# mean(ss[, 3])
# }
#
# # Compute and plot wss for k = 2 to k = 10
# k.values <- 2:10
#
# # extract silhouette scores for 2-10 clusters
# sil_values_subset_3 <- map_dbl(k.values, sil_score_subset_3)
#
# plot(k.values, sil_values_subset_3,
# type = "b", pch = 19, frame = FALSE,
# xlab = "Number of clusters k",
# ylab = "Silhouette Score")
fviz_nbclust(clustering_subset_3, kmeans, method = "silhouette")Even though the k-means model with k=2 gives the highest silhouette score, we use the clusters from the k-means models with k=3 and k=4 to generate user subgroups because having 3 or 4 user clusters could lead to more interesting insights into user subgroups’ behavior on the app.
We begin by estimating the average covariate (i.e. feature) value for each user cluster. This allows us to analyze what differentiates users who are assigned to different clusters. We also compare the average of the covariates used for clustering across different k-means clusters with the average of the same covariates across user groups obtained by manually splitting users into 3 groups based on their tenure on the app. This comparison allows us to understand how the k-means clusters compare against a different type of user subgroups we have already analyzed.
Thus, we perform 3 different types of user subgroup analysis and comparisons:
k=3, we compare the users assigned to Cluster 1 with users assigned to Cluster 3.)First, we define the user subgroups. Then we define the plot_covariate_means_by_ntile function and use it to estimate the average covariate value for each user subgroup.
# Perform k-means clustering with k=3
set.seed(2021)
k3 <- kmeans(clustering_df, centers = 3, nstart = 25)
# Add the k-means model label to each row corresp. to a user id
clustering_features_df <- clustering_features_df %>%
ungroup() %>%
mutate(k3_cluster = k3$cluster)
# Perform k-means clustering with k=4
set.seed(2021)
k4 <- kmeans(clustering_df, centers = 4, nstart = 25)
# Add the k-means model label to each row corresp. to a user id
clustering_features_df <- clustering_features_df %>%
ungroup() %>%
mutate(k4_cluster = k4$cluster)
# Get the highest number of days each users has been on the app
user_max_usage_sessions <- filtered_logged_df %>%
select(days_since_signup, child_id) %>%
group_by(child_id) %>%
dplyr::summarize(max_days_since_signup = max(days_since_signup))
# Join the users' maximum days on the app and tenure type with the clustering features dataframe
clustering_joined_df <- left_join(x = clustering_features_df, y = user_max_usage_sessions %>%
select(child_id, max_days_since_signup),
by = 'child_id')
# Add the tenure type to each row corresp. to a user id
clustering_joined_df$tenure_type <- ifelse(clustering_joined_df$max_days_since_signup<90, "Short-Term",
ifelse(clustering_joined_df$max_days_since_signup>=90 &
clustering_joined_df$max_days_since_signup<180, "Medium-Term",
"Long-Term"))clustering_features_df_added_with_tenure<-left_join(x=clustering_features_df_added, y=clustering_joined_df %>%
select(child_id, tenure_type),
by='child_id')
# Add the tenure type to each row corresp. to a user id
user_num_and_max_sessions$tenure_type <- ifelse(user_num_and_max_sessions$max_days_since_signup<90, "Short-Term",
ifelse(user_num_and_max_sessions$max_days_since_signup>=90 &
user_num_and_max_sessions$max_days_since_signup<180, "Medium-Term",
"Long-Term"))
user_num_and_max_sessions<-left_join(x=clustering_features_df, user_num_and_max_sessions, by="child_id") %>%
select(child_id, max_days_since_signup, num_days_active, prop_active, tenure_type)
tenure_df<-user_num_and_max_sessions %>%
dplyr::group_by(tenure_type) %>%
dplyr::summarize(number=n(), sum_prop_active=sum(prop_active),
avg_prop_active=mean(prop_active), st_dev_avg=sd(prop_active))
# Standardizing all variables
clustering_features_df_added_with_tenure<-clustering_features_df_added_with_tenure %>%
ungroup() %>%
mutate(k3_subset_cluster=k3_subset$cluster)
# Only temporal variables not standardized
clustering_features_df_added_with_tenure<-clustering_features_df_added_with_tenure %>%
ungroup() %>%
mutate(k3_subset_2_cluster=k3_subset_2$cluster)
# No standardization
clustering_features_df_added_with_tenure<-clustering_features_df_added_with_tenure %>%
ungroup() %>%
mutate(k3_subset_ns_cluster=k3_subset_ns$cluster)
# Uncomment to generate plot from here
# Create a dataframe with number of users and user type for plotting
# user_type_df_updated <- data.frame(num_users = c(length(short_term_users),
# length(medium_term_users),
# length(tenured_users_updated)),
# user_type = c("Short-term","Medium-term","Long-term"))
#
ggplot(tenure_df, aes(x = tenure_type, y = number)) +
geom_bar(stat = "identity", fill = 'darkblue') +
geom_text(aes(label=number), vjust=-0.2)+
xlab("User Tenure Type") +
ylab("Number of Users") +
theme_bw() +
theme(axis.text.x=element_text(hjust=0.5),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
scale_y_continuous(breaks = seq(0,14000,2000))# ggsave("tenure.jpg")# Table of covariates means/sd by n.tile. The n.tile is the variable that
# defines the subgroups here, which needs to be done before running the function.
plot_covariate_means_by_ntile <- function(.df, .ntile = "ntile", covariate_names, n_top = 10, title_label) {
.df <- as.data.frame(.df)
covariate_names <- covariate_names
.df[, .ntile] <- as.factor(.df[, .ntile])
# Regress each covariate on ntile/subgroup assignment to means p
cov_means <- lapply(covariate_names, function(covariate) {
lm_robust(as.formula(paste0(covariate, " ~ 0 + ", .ntile)), data = .df, se_type = "stata")
})
# Extract the mean and standard deviation of each covariate per ntile/subgroup
cov_table <- lapply(cov_means, function(cov_mean) {
means <- as.data.frame(t(coef(summary(cov_mean))[,c("Estimate", "Std. Error")]))
means
})
# Preparation to color the chart
temp_standardized <- sapply(seq_along(covariate_names), function(j) {
covariate_name <- covariate_names[j]
.mean <- mean(.df[, covariate_name], na.rm = TRUE)
.sd <- sd(.df[, covariate_name], na.rm = TRUE)
m <- as.matrix(round(signif(cov_table[[j]], digits=4), 3))
.standardized <- (m["Estimate",] - .mean) / .sd
.standardized
})
colnames(temp_standardized) <- covariate_names
ordering <- order(apply(temp_standardized, MARGIN = 2, function(x) {.range <- range(x); abs(.range[2] - .range[1])}), decreasing = TRUE)
color_scale <- max(abs(c(max(temp_standardized, na.rm = TRUE), min(temp_standardized, na.rm = TRUE))))
color_scale <- color_scale * c(-1,1)
max_std_dev <- floor(max(color_scale))
breaks <- -max_std_dev:max_std_dev
labels <- c(" ", breaks, " ")
breaks <- c(min(color_scale), breaks, max(color_scale))
# Little trick to display the standard errors
table <- lapply(seq_along(covariate_names), function(j) {
covariate_name <- covariate_names[j]
.mean <- mean(.df[, covariate_name], na.rm = TRUE)
.sd <- sd(.df[, covariate_name], na.rm = TRUE)
m <- as.matrix(round(signif(cov_table[[j]], digits=4), 3))
.standardized <- (m["Estimate",] - .mean) / .sd
return(data.frame(covariate = covariate_name,
group = 1:ncol(m),
estimate = m["Estimate",], std.error = m["Std. Error",],
standardized = .standardized))
})
# table <- do.call(rbind, table)
table <- rbindlist(table)
setnames(table, "group", .ntile)
table[, covariate := factor(covariate, levels = rev(covariate_names[ordering]), ordered = TRUE)]
table[covariate %in% head(covariate_names[ordering], n_top)] %>%
mutate(info = paste0(estimate, "\n(", std.error, ")")) %>%
ggplot(aes_string(x = .ntile, y = "covariate")) +
# Add coloring
geom_raster(aes(fill = standardized)
, alpha = 0.9
) +
scale_fill_distiller(palette = "RdBu",
direction = 1,
breaks = breaks,
labels = labels,
limits = color_scale,
name = "Standard\nDeviation on\nNormalized\nDistribution"
) +
# add numerics
geom_text(aes(label = info), size=2.1) +
# reformat
labs(title = paste0("Covariate averages within ", title_label),
y = "within covariate") +
scale_x_continuous(position = "top") #+
#cowplot::theme_minimal_hgrid(16)
}# For k-means clusters when k=3
avg_covars_k3_clustering_df <- clustering_joined_df %>%
select(-c(k4_cluster, child_id, tenure_type))
avg_covars_k3_clustering_df$max_days_since_signup <- as.integer(avg_covars_k3_clustering_df$max_days_since_signup)
k3_clusters_covariate_names <- colnames(avg_covars_k3_clustering_df)
k3_clusters_covariate_names <- k3_clusters_covariate_names[k3_clusters_covariate_names != "k3_cluster"]
plot_covariate_means_by_ntile(avg_covars_k3_clustering_df,
.ntile = "k3_cluster",
k3_clusters_covariate_names,
n_top = 20, # <- this can be changed to any number
title_label = "k-means clusters (k=3)"
)# For k-means clusters when k=4
avg_covars_k4_clustering_df <- clustering_joined_df %>%
select(-c(k3_cluster, child_id, tenure_type))
avg_covars_k4_clustering_df$max_days_since_signup <- as.integer(avg_covars_k4_clustering_df$max_days_since_signup)
k4_clusters_covariate_names <- colnames(avg_covars_k4_clustering_df)
k4_clusters_covariate_names <- k4_clusters_covariate_names[k4_clusters_covariate_names != "k4_cluster"]
plot_covariate_means_by_ntile(avg_covars_k4_clustering_df,
.ntile = "k4_cluster",
k4_clusters_covariate_names,
n_top = 20, # <- this can be changed to any number
title_label = "k-means clusters (k=4)"
)# For user cohorts by tenure type
avg_covars_tenure_df <- clustering_joined_df %>%
select(-c(k3_cluster, child_id, k4_cluster))
avg_covars_tenure_df$max_days_since_signup <- as.integer(avg_covars_tenure_df$max_days_since_signup)
tenure_covariate_names <- colnames(avg_covars_tenure_df)
tenure_covariate_names <- tenure_covariate_names[tenure_covariate_names != "tenure_type"]
plot_covariate_means_by_ntile(avg_covars_tenure_df,
.ntile = "tenure_type",
tenure_covariate_names,
n_top = 20, # <- this can be changed to any number
title_label = "subgroups by tenure type"
)The heatmap above outputs the covariate averages for each user subgroup. The colors indicate departure from the sample mean: blue indicates an average covariate value above the mean and red indicates an average value below the sample mean. In addition, the covariates on the y-axis are displayed by order of variation across groups. In other words, for the k-means clusters where k=3, the avg_n_stories_first2wkly_sessions variable has the most variation across user clusters (i.e. Clusters 1-3), so it is displayed first.
For the k-means clusters with k=3, Cluster 2 is the cluster with the highest values across the majority of covariates, even after taking into account the standard error of the covariates. This implies that users in this cluster are the ones that are the most engaged with the app (at least during the first few weeks). Cluster 3 is the cluster with the second highest values across the majority of covariates and Cluster 1 is the cluster with the lowest values across the majority of covariates.
For the k-means clusters with k=4, Cluster 2 is the cluster with the highest values across the majority of covariates, even after taking into account the standard error of the covariates. Cluster 1 is the cluster with the second highest values across the majority of covariates. Cluster 4 is the cluster with the second lowest values across the majority of covariates. Cluster 3 is the cluster with the lowest values across the majority of covariates.
Furthermore, the k-means user clusters have high differences in average covariate values across clusters whereas the user cohorts by tenure type have significantly smaller differences in average covariate values across cohorts. Intuitively, this observation makes sense because we used various initial behavioral characteristics to generate user clusters with k-means whereas for the tenure-based cohorts we only used a fixed characteristics (i.e. maximum day on the app since signup) without considering any behavioral characteristics.
Do you observe any other interesting patterns or trends in the heatmap above?
# For k-means clusters with k3_subset
avg_covars_k3_subset_clustering_df <- clustering_features_df_added_with_tenure %>%
select(-c(k3_subset_2_cluster, k3_subset_ns_cluster, child_id, tenure_type))
avg_covars_k3_subset_clustering_df$max_days_since_signup <- as.integer(avg_covars_k3_subset_clustering_df$max_days_since_signup)
k3_subset_clusters_covariate_names <- colnames(avg_covars_k3_subset_clustering_df)
k3_subset_clusters_covariate_names <- k3_subset_clusters_covariate_names[k3_subset_clusters_covariate_names != "k3_subset_cluster"]
plot_covariate_means_by_ntile(avg_covars_k3_subset_clustering_df,
.ntile = "k3_subset_cluster",
k3_subset_clusters_covariate_names,
n_top = 20, # <- this can be changed to any number
title_label = "k-means clusters (k=3) k3_subsets"
)# For k-means clusters with k3_subset_2
avg_covars_k3_subset_2_clustering_df <- clustering_features_df_added_with_tenure %>%
select(-c(k3_subset_cluster, k3_subset_ns_cluster, child_id, tenure_type))
avg_covars_k3_subset_2_clustering_df$max_days_since_signup <-
as.integer(avg_covars_k3_subset_2_clustering_df$max_days_since_signup)
k3_subset_2_clusters_covariate_names <- colnames(avg_covars_k3_subset_2_clustering_df)
k3_subset_2_clusters_covariate_names <- k3_subset_2_clusters_covariate_names[k3_subset_2_clusters_covariate_names !=
"k3_subset_2_cluster"]
plot_covariate_means_by_ntile(avg_covars_k3_subset_2_clustering_df,
.ntile = "k3_subset_2_cluster",
k3_subset_2_clusters_covariate_names,
n_top = 20, # <- this can be changed to any number
title_label = "k-means clusters (k=3) k3_subsets_2"
)# For k-means clusters with k3_subset_ns
avg_covars_k3_subset_ns_clustering_df <- clustering_features_df_added_with_tenure %>%
select(-c(k3_subset_cluster, k3_subset_2_cluster, child_id, tenure_type))
avg_covars_k3_subset_ns_clustering_df$max_days_since_signup <-
as.integer(avg_covars_k3_subset_ns_clustering_df$max_days_since_signup)
k3_subset_ns_clusters_covariate_names <- colnames(avg_covars_k3_subset_ns_clustering_df)
k3_subset_ns_clusters_covariate_names <- k3_subset_ns_clusters_covariate_names[k3_subset_ns_clusters_covariate_names !=
"k3_subset_ns_cluster"]
plot_covariate_means_by_ntile(avg_covars_k3_subset_ns_clustering_df,
.ntile = "k3_subset_ns_cluster",
k3_subset_ns_clusters_covariate_names,
n_top = 20, # <- this can be changed to any number
title_label = "k-means clusters (k=3) k3_subsets_ns"
)# For user cohorts by tenure type
avg_covars_tenure_df <- clustering_features_df_added_with_tenure %>%
select(-c(k3_subset_cluster, k3_subset_2_cluster, k3_subset_ns_cluster, child_id))
avg_covars_tenure_df$max_days_since_signup <- as.integer(avg_covars_tenure_df$max_days_since_signup)
tenure_covariate_names <- colnames(avg_covars_tenure_df)
tenure_covariate_names <- tenure_covariate_names[tenure_covariate_names != "tenure_type"]
plot_covariate_means_by_ntile(avg_covars_tenure_df,
.ntile = "tenure_type",
tenure_covariate_names,
n_top =20, # <- this can be changed to any number
title_label = "subgroups by tenure type"
)final_clustering_feats<-left_join(x=clustering_features_df_added_with_tenure, y=filtered_child_df %>%
select(child_id, freadom_point),
by='child_id')
avg_freadom_points<-final_clustering_feats %>%
dplyr::group_by(k3_subset_cluster) %>%
dplyr::summarize(avg_freadom_points=mean(freadom_point))
interactions_clustering<-left_join(x=filtered_logged_df, y=final_clustering_feats %>%
select(child_id, k3_subset_cluster),
by='child_id')
avg_stories_per_user<-interactions_clustering %>%
dplyr::group_by(k3_subset_cluster) %>%
dplyr::summarize(num_users=n_distinct(child_id), num_stories=n_distinct(story_id),avg_stories=num_stories/num_users)
avg_stories_per_user_updated<-interactions_clustering %>%
dplyr::group_by(k3_subset_cluster, child_id) %>%
dplyr::summarize(num_stories=n_distinct(story_id))
avg_stories_per_user_updated<-avg_stories_per_user_updated %>%
dplyr::group_by(k3_subset_cluster) %>%
dplyr::summarize(num_users=n(), sum_stories=sum(num_stories), avg_stories_p_user=mean(num_stories),
min_stories=min(num_stories),
max_stories=max(num_stories))Now let’s analyze the fixed characteristics of users in each cluster and compare these characteristics to those of the users in cohorts created by tenure type. Also, note that the assignment of users to clusters is independent of their fixed characteristics (at least directly) because we did not use these characteristics to build the features for the k-means models.
# Join the users' k-means cluster labels with the user-story interactions dataframe
clustered_users <- clustering_features_df %>%
pull(child_id)
# Get cluster labels for users across all clusters
filtered_child_df_w_clusters <- left_join(x = filtered_child_df, y = clustering_features_df %>%
select(child_id, k3_cluster, k4_cluster),
by = 'child_id') %>%
dplyr::filter(child_id %in% clustered_users)
# Get tenure type for users across all clusters
filtered_child_df_w_clusters <- left_join(x = filtered_child_df_w_clusters, y = clustering_joined_df %>%
select(child_id, tenure_type),
by = 'child_id') %>%
dplyr::filter(child_id %in% clustered_users)
# For k=3
# Get number of users by acquistion method in each k-means cluster
filtered_child_df_w_k3_clusters <- filtered_child_df_w_clusters %>%
select(k3_cluster, user) %>%
group_by(k3_cluster, user) %>%
dplyr::summarize(user_k3_cluster = n()) %>%
ungroup %>%
drop_na()
# Get proportion of users by acquisition method in each cluster
filtered_child_df_w_k3_clusters_total <- filtered_child_df_w_k3_clusters %>%
group_by(user) %>%
dplyr::summarize(user_k3_cluster_total = sum(user_k3_cluster))
filtered_child_df_w_k3_clusters <- left_join(x = filtered_child_df_w_k3_clusters, y = filtered_child_df_w_k3_clusters_total %>%
select(user, user_k3_cluster_total),
by = 'user')
filtered_child_df_w_k3_clusters$user_k3_cluster_prop = filtered_child_df_w_k3_clusters$user_k3_cluster/filtered_child_df_w_k3_clusters$user_k3_cluster_total
# Get cluster labels of users for k-means model with k=3
filtered_child_df_w_k3_clusters$k3_cluster_label <- factor(filtered_child_df_w_k3_clusters$k3_cluster)
# For k=4
# Get number of users by acquistion method in each k-means cluster
filtered_child_df_w_k4_clusters <- filtered_child_df_w_clusters %>%
select(k4_cluster, user) %>%
group_by(k4_cluster, user) %>%
dplyr::summarize(user_k4_cluster = n()) %>%
ungroup %>%
drop_na()
# Get proportion of users by acquisition method in each cluster
filtered_child_df_w_k4_clusters_total <- filtered_child_df_w_k4_clusters %>%
group_by(user) %>%
dplyr::summarize(user_k4_cluster_total = sum(user_k4_cluster))
filtered_child_df_w_k4_clusters <- left_join(x = filtered_child_df_w_k4_clusters, y = filtered_child_df_w_k4_clusters_total %>%
select(user, user_k4_cluster_total),
by = 'user')
filtered_child_df_w_k4_clusters$user_k4_cluster_prop = filtered_child_df_w_k4_clusters$user_k4_cluster/filtered_child_df_w_k4_clusters$user_k4_cluster_total
# Get cluster labels of users for k-means model with k=4
filtered_child_df_w_k4_clusters$k4_cluster_label <- factor(filtered_child_df_w_k4_clusters$k4_cluster)
# Get number of users by acquistion method in each user cohort created by tenure type
filtered_child_df_w_tenures <- filtered_child_df_w_clusters %>%
select(tenure_type, user) %>%
group_by(tenure_type, user) %>%
dplyr::summarize(user_tenure_type = n()) %>%
ungroup %>%
drop_na()
# Get proportion of users by acquisition method in each cluster
filtered_child_df_w_tenures_total <- filtered_child_df_w_tenures %>%
group_by(user) %>%
dplyr::summarize(user_tenure_type_total = sum(user_tenure_type))
filtered_child_df_w_tenures <- left_join(x = filtered_child_df_w_tenures, y = filtered_child_df_w_tenures_total %>%
select(user, user_tenure_type_total),
by = 'user')
filtered_child_df_w_tenures$user_tenure_type_prop = filtered_child_df_w_tenures$user_tenure_type/filtered_child_df_w_tenures$user_tenure_type_total
filtered_child_df_w_tenures$tenure_type <- factor(filtered_child_df_w_tenures$tenure_type)
# Plot the user acquistion method for each cluster of k-means clustering
# For k=3
p1 <- ggplot(filtered_child_df_w_k3_clusters,
aes(user, user_k3_cluster_prop, fill = k3_cluster_label)) +
geom_col(stat = 'identity', position="dodge") +
xlab("User Acquisition Method") +
ylab("Proportion of Users") +
theme_bw() +
theme(axis.text.x=element_text(angle = 90, hjust=0.5),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
ggtitle("K-means Clusters (k=3)")
# For k=4
p2 <- ggplot(filtered_child_df_w_k4_clusters,
aes(user, user_k4_cluster_prop, fill = k4_cluster_label)) +
geom_col(stat = 'identity', position="dodge") +
xlab("User Acquisition Method") +
ylab("Proportion of Users") +
theme_bw() +
theme(axis.text.x=element_text(angle = 90, hjust=0.5),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
ggtitle("K-means Clusters (k=4)")
# For users by tenure type
p3 <- ggplot(filtered_child_df_w_tenures,
aes(user, user_tenure_type_prop, fill = tenure_type)) +
geom_col(stat = 'identity', position="dodge") +
xlab("User Acquisition Method") +
ylab("Proportion of Users") +
theme_bw() +
theme(axis.text.x=element_text(angle = 90, hjust=0.5),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
ggtitle("Tenure Cohorts")
grid.arrange(p1, p2, p3, nrow = 2)The two plots in the top row represent the proportion of users by acquisition method in each cluster. These user clusters are generated by the k-means models with k=3 and k=4, respectively.
The plot in the second row represents the proportion of users by acquisition method in each cohort. These user cohorts are generated by classifying users into one of the 3 tenure types (“short-term”, “medium-term”, and “long-term”) based on their tenure on the app (i.e. the maximum days active on the app since signup).
Interestingly, the proportion of users by acquisition method is very similar across clusters for the k-means clusters, both when k=3 and k=4. As for the cohorts by tenure type, the proportions of b2b users who are classified as medium- and long-term users are higher than the proportions of users by other acquisition methods, which is what we observed in the previous tutorials as well.
# Join the users' k-means cluster labels with the user-story interactions dataframe
clustered_users_updated <- clustering_features_df_added_with_tenure %>%
pull(child_id)
# Get cluster labels for users across all clusters
filtered_child_df_w_clusters_updated <- left_join(x = filtered_child_df, y = clustering_features_df_added_with_tenure %>%
select(child_id, k3_subset_cluster),
by = 'child_id') %>%
dplyr::filter(child_id %in% clustered_users_updated)
# Get tenure type for users across all clusters
filtered_child_df_w_clusters_updated <- left_join(x = filtered_child_df_w_clusters_updated, y = clustering_features_df_added_with_tenure %>%
select(child_id, tenure_type),
by = 'child_id') %>%
dplyr::filter(child_id %in% clustered_users_updated)
# For k=3
# Get number of users by acquistion method in each k-means cluster
filtered_child_df_w_k3_clusters_updated <- filtered_child_df_w_clusters_updated %>%
select(k3_subset_cluster, user) %>%
group_by(k3_subset_cluster, user) %>%
dplyr::summarize(user_k3_cluster = n()) %>%
ungroup %>%
drop_na()
# Get proportion of users by acquisition method in each cluster
filtered_child_df_w_k3_clusters_total_updated <- filtered_child_df_w_k3_clusters_updated %>%
group_by(user) %>%
dplyr::summarize(user_k3_cluster_total = sum(user_k3_cluster))
filtered_child_df_w_k3_clusters_updated <- left_join(x = filtered_child_df_w_k3_clusters_updated, y = filtered_child_df_w_k3_clusters_total_updated %>%
select(user, user_k3_cluster_total),
by = 'user')
filtered_child_df_w_k3_clusters_updated$user_k3_cluster_prop = filtered_child_df_w_k3_clusters_updated$user_k3_cluster/filtered_child_df_w_k3_clusters_updated$user_k3_cluster_total
# Get cluster labels of users for k-means model with k=3
filtered_child_df_w_k3_clusters_updated$k3_cluster_label <- factor(filtered_child_df_w_k3_clusters_updated$k3_subset_cluster)
# Plot the user acquistion method for each cluster of k-means clustering
# For k=3
ggplot(filtered_child_df_w_k3_clusters_updated,
aes(user, user_k3_cluster_prop, fill = k3_cluster_label)) +
geom_col(stat = 'identity', position="dodge") +
xlab("User Acquisition Method") +
ylab("Proportion of Users") +
theme_bw() +
theme(axis.text.x=element_text(angle = 90, hjust=0.5),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
scale_fill_discrete(name="Usage categories", labels = c("Superusers", "Stragglers", "Middle-users"))+
ggtitle("K-means Clusters (k=3)")# ggsave("user_acquisition.jpg")Perform similar analysis using other fixed characteristics for user subgroups. Are there any interesting patterns or trends?
# Join the users' k-means cluster labels with the user-story interactions dataframe
clustered_users_updated <- clustering_features_df_added_with_tenure %>%
pull(child_id)
# Get cluster labels for users across all clusters
filtered_child_df_w_clusters_updated <- left_join(x = filtered_child_df, y = clustering_features_df_added_with_tenure %>%
select(child_id, k3_subset_cluster),
by = 'child_id') %>%
dplyr::filter(child_id %in% clustered_users_updated)
# Get tenure type for users across all clusters
filtered_child_df_w_clusters_updated <- left_join(x = filtered_child_df_w_clusters_updated, y = clustering_features_df_added_with_tenure %>%
select(child_id, tenure_type),
by = 'child_id') %>%
dplyr::filter(child_id %in% clustered_users_updated)
# For k=3
# Get number of users by grade in each k-means cluster
filtered_child_df_w_k3_clusters_updated <- filtered_child_df_w_clusters_updated %>%
select(k3_subset_cluster, grade) %>%
group_by(k3_subset_cluster, grade) %>%
dplyr::summarize(grade_k3_cluster = n()) %>%
ungroup %>%
drop_na()
# Get proportion of users by grade in each cluster
filtered_child_df_w_k3_clusters_total_updated <- filtered_child_df_w_k3_clusters_updated %>%
group_by(grade) %>%
dplyr::summarize(grade_k3_cluster_total = sum(grade_k3_cluster))
filtered_child_df_w_k3_clusters_updated <- left_join(x = filtered_child_df_w_k3_clusters_updated, y = filtered_child_df_w_k3_clusters_total_updated %>%
select(grade, grade_k3_cluster_total),
by = 'grade')
filtered_child_df_w_k3_clusters_updated$grade_k3_cluster_prop = filtered_child_df_w_k3_clusters_updated$grade_k3_cluster/filtered_child_df_w_k3_clusters_updated$grade_k3_cluster_total
# Get cluster labels of users for k-means model with k=3
filtered_child_df_w_k3_clusters_updated$k3_cluster_label <- factor(filtered_child_df_w_k3_clusters_updated$k3_subset_cluster)
plot_order<-c(4, 5, 6, 7, 8, 9, 2, 1, 3)
filtered_child_df_w_k3_clusters_updated<-cbind(filtered_child_df_w_k3_clusters_updated,plot_order)
# Plot the grade for each cluster of k-means clustering
# For k=3
ggplot(filtered_child_df_w_k3_clusters_updated,
aes(reorder(grade, plot_order), grade_k3_cluster_prop, fill = k3_cluster_label)) +
geom_col(stat = 'identity', position="dodge") +
xlab("Grade") +
ylab("Proportion of Users") +
theme_bw() +
theme(axis.text.x=element_text(angle = 90, hjust=0.5),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
scale_fill_discrete(name="Usage categories", labels = c("Superusers", "Stragglers", "Middle-users"))+
ggtitle("K-means Clusters (k=3)")# ggsave("grade.jpg")# Join the users' k-means cluster labels with the user-story interactions dataframe
clustered_users_updated <- clustering_features_df_added_with_tenure %>%
pull(child_id)
# Get cluster labels for users across all clusters
filtered_child_df_w_clusters_updated <- left_join(x = filtered_child_df, y = clustering_features_df_added_with_tenure %>%
select(child_id, k3_subset_cluster),
by = 'child_id') %>%
dplyr::filter(child_id %in% clustered_users_updated)
# Get tenure type for users across all clusters
filtered_child_df_w_clusters_updated <- left_join(x = filtered_child_df_w_clusters_updated, y = clustering_features_df_added_with_tenure %>%
select(child_id, tenure_type),
by = 'child_id') %>%
dplyr::filter(child_id %in% clustered_users_updated)
# For k=3
# Get number of users by reading level in each k-means cluster
filtered_child_df_w_k3_clusters_updated <- filtered_child_df_w_clusters_updated %>%
select(k3_subset_cluster, grade_level) %>%
group_by(k3_subset_cluster, grade_level) %>%
dplyr::summarize(grade_level_k3_cluster = n()) %>%
ungroup %>%
drop_na()
# Get proportion of users by reading level in each cluster
filtered_child_df_w_k3_clusters_total_updated <- filtered_child_df_w_k3_clusters_updated %>%
group_by(grade_level) %>%
dplyr::summarize(grade_level_k3_cluster_total = sum(grade_level_k3_cluster))
filtered_child_df_w_k3_clusters_updated <- left_join(x = filtered_child_df_w_k3_clusters_updated, y = filtered_child_df_w_k3_clusters_total_updated %>%
select(grade_level, grade_level_k3_cluster_total),
by = 'grade_level')
filtered_child_df_w_k3_clusters_updated$grade_level_k3_cluster_prop = filtered_child_df_w_k3_clusters_updated$grade_level_k3_cluster/filtered_child_df_w_k3_clusters_updated$grade_level_k3_cluster_total
# Get cluster labels of users for k-means model with k=3
filtered_child_df_w_k3_clusters_updated$k3_cluster_label <- factor(filtered_child_df_w_k3_clusters_updated$k3_subset_cluster)
# Plot the user reading level for each cluster of k-means clustering
# For k=3
ggplot(filtered_child_df_w_k3_clusters_updated,
aes(grade_level, grade_level_k3_cluster_prop, fill = k3_cluster_label)) +
geom_col(stat = 'identity', position="dodge") +
xlab("Reading Level on the App") +
ylab("Proportion of Users") +
theme_bw() +
theme(axis.text.x=element_text(angle = 90, hjust=0.5),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
scale_fill_discrete(name="Usage categories", labels = c("Superusers", "Stragglers", "Middle-users"))+
ggtitle("K-means Clusters (k=3)")# ggsave("grade_level.jpg")# Join the users' k-means cluster labels with the user-story interactions dataframe
clustered_users_updated <- clustering_features_df_added_with_tenure %>%
pull(child_id)
# Get cluster labels for users across all clusters
filtered_child_df_w_clusters_updated <- left_join(x = filtered_child_df, y = clustering_features_df_added_with_tenure %>%
select(child_id, k3_subset_cluster),
by = 'child_id') %>%
dplyr::filter(child_id %in% clustered_users_updated)
# Get tenure type for users across all clusters
filtered_child_df_w_clusters_updated <- left_join(x = filtered_child_df_w_clusters_updated, y = clustering_features_df_added_with_tenure %>%
select(child_id, tenure_type),
by = 'child_id') %>%
dplyr::filter(child_id %in% clustered_users_updated)
# For k=3
# Get number of users by acquistion method in each k-means cluster
filtered_child_df_w_k3_clusters_updated <- filtered_child_df_w_clusters_updated %>%
select(k3_subset_cluster, user) %>%
group_by(k3_subset_cluster, user) %>%
dplyr::summarize(user_k3_cluster = n()) %>%
ungroup %>%
drop_na()
# Get proportion of users by acquisition method in each cluster
filtered_child_df_w_k3_clusters_total_updated <- filtered_child_df_w_k3_clusters_updated %>%
group_by(user) %>%
dplyr::summarize(user_k3_cluster_total = sum(user_k3_cluster))
filtered_child_df_w_k3_clusters_updated <- left_join(x = filtered_child_df_w_k3_clusters_updated, y = filtered_child_df_w_k3_clusters_total_updated %>%
select(user, user_k3_cluster_total),
by = 'user')
# Proportion by cluster
clusters_total<-filtered_child_df_w_k3_clusters_updated %>%
dplyr::group_by(k3_subset_cluster) %>%
dplyr::summarize(cluster_total=sum(user_k3_cluster))
filtered_child_df_w_k3_clusters_updated <- left_join(x = filtered_child_df_w_k3_clusters_updated, y = clusters_total,
by = 'k3_subset_cluster')
filtered_child_df_w_k3_clusters_updated$k3_cluster_prop = filtered_child_df_w_k3_clusters_updated$user_k3_cluster/filtered_child_df_w_k3_clusters_updated$cluster_total
# Get cluster labels of users for k-means model with k=3
filtered_child_df_w_k3_clusters_updated$k3_cluster_label <- factor(filtered_child_df_w_k3_clusters_updated$k3_subset_cluster)
filtered_child_df_w_k3_clusters_updated<-filtered_child_df_w_k3_clusters_updated %>%
mutate(Usage_Group=case_when(k3_subset_cluster==1 ~ "Superusers",
k3_subset_cluster==2 ~ "Stragglers",
k3_subset_cluster==3 ~ "Middle-users"))
filtered_child_df_w_k3_clusters_updated$k3_cluster_label <- factor(filtered_child_df_w_k3_clusters_updated$Usage_Group)
# Plot the user acquistion method for each cluster of k-means clustering
# For k=3
ggplot(filtered_child_df_w_k3_clusters_updated,
aes(Usage_Group, k3_cluster_prop, fill = user)) +
geom_col(stat = 'identity', position="dodge") +
xlab("Cluster") +
ylab("Proportion of Cluster") +
theme_bw() +
theme(axis.text.x=element_text(angle = 90, hjust=0.5),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
#scale_fill_discrete(name="Usage categories", labels = c("Superusers", "Stragglers", "Middle-users"))+
ggtitle("K-means Clusters (k=3)")+
scale_x_discrete(limits = c("Superusers", "Middle-users", "Stragglers"))# gsave("user_acquisition_2.jpg")# Join the users' k-means cluster labels with the user-story interactions dataframe
clustered_users_updated <- clustering_features_df_added_with_tenure %>%
pull(child_id)
# Get cluster labels for users across all clusters
filtered_child_df_w_clusters_updated <- left_join(x = filtered_child_df, y = clustering_features_df_added_with_tenure %>%
select(child_id, k3_subset_cluster),
by = 'child_id') %>%
dplyr::filter(child_id %in% clustered_users_updated)
# Get tenure type for users across all clusters
filtered_child_df_w_clusters_updated <- left_join(x = filtered_child_df_w_clusters_updated, y = clustering_features_df_added_with_tenure %>%
select(child_id, tenure_type),
by = 'child_id') %>%
dplyr::filter(child_id %in% clustered_users_updated)
# For k=3
# Get number of users by grade in each k-means cluster
filtered_child_df_w_k3_clusters_updated <- filtered_child_df_w_clusters_updated %>%
select(k3_subset_cluster, grade) %>%
group_by(k3_subset_cluster, grade) %>%
dplyr::summarize(grade_k3_cluster = n()) %>%
ungroup %>%
drop_na()
# Get proportion of users by grade in each cluster
filtered_child_df_w_k3_clusters_total_updated <- filtered_child_df_w_k3_clusters_updated %>%
group_by(grade) %>%
dplyr::summarize(grade_k3_cluster_total = sum(grade_k3_cluster))
filtered_child_df_w_k3_clusters_updated <- left_join(x = filtered_child_df_w_k3_clusters_updated, y = filtered_child_df_w_k3_clusters_total_updated %>%
select(grade, grade_k3_cluster_total),
by = 'grade')
# Proportion by cluster
clusters_total<-filtered_child_df_w_k3_clusters_updated %>%
dplyr::group_by(k3_subset_cluster) %>%
dplyr::summarize(cluster_total=sum(grade_k3_cluster))
filtered_child_df_w_k3_clusters_updated <- left_join(x = filtered_child_df_w_k3_clusters_updated, y = clusters_total,
by = 'k3_subset_cluster')
filtered_child_df_w_k3_clusters_updated$grade_k3_cluster_prop = filtered_child_df_w_k3_clusters_updated$grade_k3_cluster/filtered_child_df_w_k3_clusters_updated$cluster_total
filtered_child_df_w_k3_clusters_updated<-filtered_child_df_w_k3_clusters_updated %>%
mutate(Usage_Group=case_when(k3_subset_cluster==1 ~ "Superusers",
k3_subset_cluster==2 ~ "Stragglers",
k3_subset_cluster==3 ~ "Middle-users"))
filtered_child_df_w_k3_clusters_updated$k3_cluster_label <- factor(filtered_child_df_w_k3_clusters_updated$Usage_Group)
plot_order<-c(4, 5, 6, 7, 8, 9, 2, 1, 3)
filtered_child_df_w_k3_clusters_updated<-cbind(filtered_child_df_w_k3_clusters_updated,plot_order)
# Plot the grade for each cluster of k-means clustering
# For k=3
ggplot(filtered_child_df_w_k3_clusters_updated,
aes(Usage_Group, grade_k3_cluster_prop, fill = reorder(grade,plot_order) )) +
geom_col(stat = 'identity', position="dodge") +
xlab("Cluster") +
ylab("Proportion of Cluster") +
theme_bw() +
theme(axis.text.x=element_text(angle = 90, hjust=0.5),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
scale_fill_discrete(name="grade")+
ggtitle("K-means Clusters (k=3)")+
scale_x_discrete(limits = c("Superusers", "Middle-users", "Stragglers"))# ggsave("grade_2.jpg")# Join the users' k-means cluster labels with the user-story interactions dataframe
clustered_users_updated <- clustering_features_df_added_with_tenure %>%
pull(child_id)
# Get cluster labels for users across all clusters
filtered_child_df_w_clusters_updated <- left_join(x = filtered_child_df, y = clustering_features_df_added_with_tenure %>%
select(child_id, k3_subset_cluster),
by = 'child_id') %>%
dplyr::filter(child_id %in% clustered_users_updated)
# Get tenure type for users across all clusters
filtered_child_df_w_clusters_updated <- left_join(x = filtered_child_df_w_clusters_updated, y = clustering_features_df_added_with_tenure %>%
select(child_id, tenure_type),
by = 'child_id') %>%
dplyr::filter(child_id %in% clustered_users_updated)
# For k=3
# Get number of users by reading level in each k-means cluster
filtered_child_df_w_k3_clusters_updated <- filtered_child_df_w_clusters_updated %>%
select(k3_subset_cluster, grade_level) %>%
group_by(k3_subset_cluster, grade_level) %>%
dplyr::summarize(grade_level_k3_cluster = n()) %>%
ungroup %>%
drop_na()
# Get proportion of users by reading level in each cluster
filtered_child_df_w_k3_clusters_total_updated <- filtered_child_df_w_k3_clusters_updated %>%
group_by(grade_level) %>%
dplyr::summarize(grade_level_k3_cluster_total = sum(grade_level_k3_cluster))
filtered_child_df_w_k3_clusters_updated <- left_join(x = filtered_child_df_w_k3_clusters_updated, y = filtered_child_df_w_k3_clusters_total_updated %>%
select(grade_level, grade_level_k3_cluster_total),
by = 'grade_level')
# Proportion by cluster
clusters_total<-filtered_child_df_w_k3_clusters_updated %>%
dplyr::group_by(k3_subset_cluster) %>%
dplyr::summarize(cluster_total=sum(grade_level_k3_cluster))
filtered_child_df_w_k3_clusters_updated <- left_join(x = filtered_child_df_w_k3_clusters_updated, y = clusters_total,
by = 'k3_subset_cluster')
filtered_child_df_w_k3_clusters_updated$grade_level_k3_cluster_prop = filtered_child_df_w_k3_clusters_updated$grade_level_k3_cluster/filtered_child_df_w_k3_clusters_updated$cluster_total
filtered_child_df_w_k3_clusters_updated<-filtered_child_df_w_k3_clusters_updated %>%
mutate(Usage_Group=case_when(k3_subset_cluster==1 ~ "Superusers",
k3_subset_cluster==2 ~ "Stragglers",
k3_subset_cluster==3 ~ "Middle-users"))
filtered_child_df_w_k3_clusters_updated$k3_cluster_label <- factor(filtered_child_df_w_k3_clusters_updated$Usage_Group)
# Plot the reading level for each cluster of k-means clustering
# For k=3
ggplot(filtered_child_df_w_k3_clusters_updated,
aes(Usage_Group, grade_level_k3_cluster_prop, fill = grade_level)) +
geom_col(stat = 'identity', position="dodge") +
xlab("Cluster") +
ylab("Proportion of Cluster") +
theme_bw() +
theme(axis.text.x=element_text(angle = 90, hjust=0.5),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
#scale_fill_discrete(name="Usage categories", labels = c("Superusers", "Stragglers", "Middle-users"))+
ggtitle("K-means Clusters (k=3)")+
scale_x_discrete(limits = c("Superusers", "Middle-users", "Stragglers"))# ggsave("grade_level_2.jpg")Compare the k-means user clusters to other user subgroups in addition to the user cohorts by tenure type. How do they differ? Can you obtain any interesting findings from these user subgroup comparisons?
We begin by investigating whether we can predict which users will remain active on the app after the first X weeks based on their fixed and initial behavioral characteristics. Accurately predicting whether a user will remain on the app or not after a certain time period, in particular by using their fixed and initial behavioral characteristics, could help S2M implement different strategies (i.e. send nudges on the app, unlock unique features, introduce new skills, etc.) to retain and re-engage the users who are most likely to churn based on their initial characteristics.
Thus, in this section, we build machine learning models aimed at answering the following question:
Considering users who have at least n sessions in the first m weeks, is it possible to accurately predict who will maintain a minimum level of engagement over x weeks and who will not?
In our models, we let n=6, m=3, and x=8.
We pick m to be 3 (weeks) because the behavioral features in the first 2 weeks are more likely to be noisy, which in turn, would make it challenging for the model to predict accurately.
We pick n to be 6 (sessions) because in the last tutorial, we observed that the average number of active days/week for long-term users in the first 3 weeks is ~2.2 days/week. The intuition for this is that we want to predict retention rates for users who initially exhibit app usage behavior that is similar to long-term users’ initial behavior. Thus, by focusing on this subsample of users, we can potentially provide insight into other important user characteristics that S2M could consider to determine whether a user is likely to stay on the app or exit based on their initial behavior on the app.
We pick x to be 8 (weeks) by taking into account the fact that the user-story interactions data is from May 1, 2020 to February 1, 2020. Otherwise, if we were to pick a much higher x, we might end up shrinking our sample size non-trivially.
Pick different n, m, and x values and investigate the predictive performance of the ML models with these new parameters.
We begin by filtering out users in our dataframes who have signed up on the app on or after December 1st, 2020. Otherwise, we cannot observe whether they remained active on the app after 8 weeks as we do not have activity data on or after February 1st, 2021. As in previous tutorials, you might need to modify this filtering logic based on the question that you are trying to answer.
# Select users who created their account on or after Dec. 1, 2020
post_dec_users <- filtered_child_df %>%
dplyr::filter(created_at >= "2020-12-01") %>%
pull(child_id)
# Filter out users who created their account on or after Dec. 1, 2020
filtered_logged_df_may_dec <- filtered_logged_df %>%
dplyr::filter(!(child_id %in% post_dec_users))
# Get users' highest number of days active since signup
user_max_sessions <- filtered_logged_df_may_dec %>%
select(child_id, sessions_since_signup, days_since_signup) %>%
group_by(child_id) %>%
dplyr::summarize(sessions_since_signup = max(sessions_since_signup),days_since_signup=max(days_since_signup))Then, we filter out users who do not meet the app activity criteria we define. That is, users who have less than n sessions in the first m weeks (in our case, n=6; m=3).
user_max_sessions_3wks = user_max_sessions %>%
dplyr::filter(days_since_signup >= 21) %>%
pull(child_id)
# Get users who exhibit the characteristics described in the second model approach (i.e. n=6;m=3)
users_frst_3wks_daily_stories <- filtered_logged_df_may_dec %>%
dplyr::filter(child_id %in% user_max_sessions_3wks) %>%
dplyr::filter(days_since_signup < 21) %>% # 3 weeks = 21 days
select(child_id, days_since_signup)%>%
group_by(child_id) %>%
dplyr::summarize(n_sessions = n_distinct(days_since_signup)) %>%
dplyr::filter(n_sessions >= 6) %>%
pull(child_id)
# Create dataframe for users who exhibit the characteristics described in the second model approach
filtered_logged_df_min_engagement_3wks = filtered_logged_df_may_dec %>%
dplyr::filter(child_id %in% users_frst_3wks_daily_stories)As before, we transform the raw data into features that better represent the underlying problem to our model to train our machine learning model. We use both fixed and initial behavioral characteristics to build the model features.
You are highly encouraged to brainstorm and engineer other features that could improve the model’s predictive performance. If you do so, add those features to the model and investigate how they affect the performance.
The final step before building the predictive models is to preprocess the features that we will use to train our model on. The preprocessing step includes:
We create the binary dependent variable, retained_8wks, which we assign a value of 1 if a user has remained active on the app for at least the first 8 weeks and a value of 0 if a user has stopped using the app within the first 8 weeks.
Next, we split the dataset into a training and test set. We do this by drawing a random sample of 70% into the training set and 30% into the test set. The reason for this train-test split is that we want to train our model on the bulk of the data and then check how well our model performs on data the model has never seen. This way we make sure that our model does not overfit (i.e. is too closely modeled on the training dataset and does not generalize well to other datasets). Note that 80-20 is also a common train-test split method. However, because the number of users in our processed dataset is relatively small (~2600 users), we want to make sure that we include a sufficient number of users from both classes in our test set (retained and not retained after 8 weeks) to understand the performance of the model on both user classes.
After splitting the dataset, we standardize our features by centering (subtracting the mean) and scaling (dividing by standard deviation) them. We compute those parameters for standardization on the training set (i.e. means and standard deviations) and apply the same parameters on the continuous variables of both the training and test sets. The reason for this standardization is that machine learning algorithms (e.g. linear regression, logistic regression) use gradient descent as an optimization technique. Gradient descent uses a step size for each feature to update the model parameters (i.e. coefficients) in order to minimize a cost function, which in turn improves our model’s predictive performance. Due to the presence of features X in the gradient descent formula, the values of these features affect the step size of the gradient descent. Thus, the difference in ranges of features leads to different step sizes for each feature. To ensure that the steps for gradient descent are updated at the same rate for all features and that the algorithm moves smoothly towards its convergence point (i.e. minima), we scale the dataset that we feed into our model. See https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html for more on gradient descent.
As a final step, we convert the categorical variables into dummy variables because, as discussed above, we do not want the model to interpret the categories of these variables as numerical values.
# Create the dependent variable 'retained_8wks' using the `max_sessions_since_signup` variable
# If max_sessions_since_signup is greater than 7 it means that user was active for at least the first 8 weeks (bc in the first week `max_sessions_since_signup` equals 0) so we assign 'retained_8wks' a value of 1
features_df$retained_8wks <- ifelse(features_df$max_sessions_since_signup>7, 1, 0)
# Use train_fraction to split the dataset into training and test sets
train_fraction <- 0.70
n <- dim(features_df)[1]
# Set seed to ensure same training and test set used for model evaluation in different scripts
set.seed(5555)
# Randomly select the specified fraction of indices to be included in the training set
train_idx <- sample.int(n, replace=F, size=floor(n*train_fraction))
# Generate training sample using the 70-30 split rule
features_train <- as.data.frame(features_df[train_idx,])
features_test <- as.data.frame(features_df[-train_idx,])
# Select continuous variables to be scaled
cont<- features_df %>%select(-c(child_id, grade,
user, retained_8wks))
cont<-c(colnames(cont))
# Save unscaled features for model result interpretations later on
features_train_unscaled <- features_train
features_test_unscaled <- features_test
# Select continuous variables to be scaled
# Compute parameters for standardization on training set
pre_proc_val <- preProcess(features_train[,cont], method = c("center", "scale"))
# Standardize the continuous variables of the training and test sets by applying the 'pre_proc_val' parameters above
features_train[,cont] = predict(pre_proc_val, features_train[,cont])
features_test[,cont] = predict(pre_proc_val, features_test[,cont])
# Make dummy variables for variables of both training and test sets (scaled and unscaled versions)
cols_reg = c(colnames(features_df))
dummies <- dummyVars(retained_8wks ~ ., data = features_df[,cols_reg])
features_train_dummies = predict(dummies, newdata = features_train[,cols_reg])
features_test_dummies = predict(dummies, newdata = features_test[,cols_reg])In this tutorial, we build a machine learning model to predict a user’s engagement after the first 8 weeks on the app. However, since x (first 8 weeks) was chosen somewhat arbitrarily, you are highly encouraged to modify this variable and investigate the model performance.
Exercise: You might consider building other types of models, using more sophisticated approaches to pick the input features for the models, and/or using similar techniques to what you have done in your individual ML assignment in order to fine-tune the model parameters. # Exercise 6
We start by building a logistic regression model in which the y variable is the binary variable retained_8wks, which equals 1 if the user has been active for at least 8 weeks and 0 otherwise. We then evaluate the logistic model performance by performing predictions on the test set. We use these predictions to compute the model performance metrics (i.e. accuracy, precision, and recall scores) and build a confusion matrix. These performance metrics as well as the confusion matrix allow us to compare the logistic model performance with the performance of other models we will be building in the next subsections.
A logistic regression models is the go-to, baseline machine learning model used for binary classification. Similar to a linear regression, the input values X of a logistic regression, are combined linearly using weights or coefficient values to predict an output value, y; however, the output value of a logistic regression (also known as the score) is mapped into a value between 0 and 1 using the equation below (which is also known as the sigmoid function): \[y = 1 / (1 + e^{-score})\]
Thus, in binary classification problems, we use logistic regression to model the probability that the input values X belong to the default class (y=1 where y=retained_8wks). Note that to assign a class to these input values, we need to transform the model probabilities into binary values (0 or 1). In other words, if the output value y is greater than 0.5 then the user is more likely to remain active on the app after the first 8 weeks, so we assign the user the default class of 1.
# Create the dependent variable 'retained_8wks'
y_train <- factor(features_train$retained_8wks)
y_test <- factor(features_test$retained_8wks)
# `max_sessions_since_signup` and `max_days_since_signup` are perfectly correlated with `retained_8wks` as we use the former to create the latter indicator variable
X_train <- data.frame(features_train_dummies) %>% select(-c(max_sessions_since_signup,
max_days_since_signup,
child_id,
sessions_since_signup))
X_test <- data.frame(features_test_dummies) %>% select(-c(max_sessions_since_signup,
max_days_since_signup,
child_id,
sessions_since_signup))
# Create a training set dataframe as an input to the logistic model
train_lm<-as.data.frame(cbind(X_train, y_train))
colnames(train_lm)<-c(colnames(X_train),"retained_8wks")
# Create a test set dataframe to be used after building the logistic model for performance evaluation
test_lm<-as.data.frame(cbind(X_test, y_test))
colnames(test_lm)<-c(colnames(X_test),"retained_8wks")
# Create a formula from a string; i.e. "y~ x1 + x2 + x3" where y=log_d_earn and X=cps_log_train_dummies
logit_reg<-as.formula(
paste("retained_8wks", paste(colnames(X_train), collapse = " + "), sep = " ~ "))
# Build a logistic model to predict whether a user will churn/exit after 8 weeks
logit_model <-glm(logit_reg, data = train_lm, family="binomial")
# Get the summary statistics for this logistic model
summary(logit_model)##
## Call:
## glm(formula = logit_reg, family = "binomial", data = train_lm)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1829 -1.1821 0.5558 1.0233 1.8385
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.977585 0.263612 7.502 6.29e-14 ***
## n_stories_firstwkly_session 0.425617 0.659130 0.646 0.51846
## avg_n_stories_firstwk -0.164136 0.264699 -0.620 0.53520
## n_sessions_firstwk 0.065296 0.116688 0.560 0.57577
## unique_stories_firstwk -0.556803 0.591167 -0.942 0.34626
## unique_sources_firstwk -0.005453 0.105770 -0.052 0.95888
## n_stories_first2wkly_sessions -0.201978 1.069173 -0.189 0.85016
## avg_n_stories_first2wkly_sessions -0.260178 0.355183 -0.733 0.46385
## avg_n_stories_first2wks 0.011382 0.433129 0.026 0.97904
## n_sessions_first2wks -0.151961 0.174594 -0.870 0.38410
## unique_stories_first2wks 0.636211 0.955271 0.666 0.50541
## unique_sources_first2wks 0.010749 0.146307 0.073 0.94143
## n_stories_first3wkly_sessions -0.941830 0.825192 -1.141 0.25373
## avg_n_stories_first3wkly_sessions 0.434709 0.340624 1.276 0.20188
## avg_n_stories_first3wks 0.139431 0.395762 0.352 0.72461
## n_sessions_first3wks 0.285332 0.149177 1.913 0.05579 .
## unique_stories_first3wks 0.270478 0.701156 0.386 0.69967
## unique_sources_first3wks 0.006350 0.120588 0.053 0.95800
## grade.Grade.1 -0.060884 0.226407 -0.269 0.78799
## grade.Grade.2 -0.123461 0.232029 -0.532 0.59466
## grade.Grade.3 -0.420457 0.236177 -1.780 0.07503 .
## grade.Grade.4 -0.525784 0.240512 -2.186 0.02881 *
## grade.Grade.5 -14.999918 311.693745 -0.048 0.96162
## grade.Grade.6 -1.861880 1.140374 -1.633 0.10253
## grade.Junior.KG -0.392144 0.292821 -1.339 0.18051
## grade.Nursery -0.105479 0.357341 -0.295 0.76786
## grade.Senior.KG NA NA NA NA
## user.b2b -0.018760 0.243186 -0.077 0.93851
## user.b2c -1.528378 0.182053 -8.395 < 2e-16 ***
## user.expired -1.261683 0.430831 -2.928 0.00341 **
## user.paid NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2312.6 on 1774 degrees of freedom
## Residual deviance: 2092.6 on 1746 degrees of freedom
## AIC: 2150.6
##
## Number of Fisher Scoring iterations: 13
Note that for the dummy variables, R automatically assigns NA as a coefficient to the baseline dummy variable when running a regression. This NA indicates that the variable in question (i.e. user.paid) is linearly related to the other variables (in this case, user.b2b, user.b2c, and user.expired). Thus, the regression would not have a unique solution without dropping one of these variables (in this case, user.paid). Also, note that these NA values do not have any effect on the computations of other coefficients.
We now evaluate the performance of this logistic regression model using the accuracy, precision, and recall scores (among other scores) on the test set.
# Predict and evaluate on training set
logit_probs_train <- predict(logit_model, newdata = train_lm, type="response")
X_train$logit_probs <- logit_probs_train
# if predicted probability value > 0.5, user is more likely to remain active on the app after 8 weeks
logit_preds_train <- ifelse(logit_probs_train > 0.5, 1, 0)
X_train$logit_preds <- logit_preds_train
# Model accuracy and confusion matrix for training set
mean(logit_preds_train == y_train)## [1] 0.6687324
confusionMatrix(data=factor(logit_preds_train), y_train)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 182 137
## 1 451 1005
##
## Accuracy : 0.6687
## 95% CI : (0.6463, 0.6906)
## No Information Rate : 0.6434
## P-Value [Acc > NIR] : 0.01339
##
## Kappa : 0.1884
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.2875
## Specificity : 0.8800
## Pos Pred Value : 0.5705
## Neg Pred Value : 0.6902
## Prevalence : 0.3566
## Detection Rate : 0.1025
## Detection Prevalence : 0.1797
## Balanced Accuracy : 0.5838
##
## 'Positive' Class : 0
##
# Predict and evaluate on test set
logit_probs_test <- predict(logit_model, newdata = test_lm, type="response")
X_test$logit_probs <- logit_probs_test
# if predicted probability value > 0.5, user is more likely to remain active on the app after 8 weeks
logit_preds_test <- ifelse(logit_probs_test > 0.5, 1, 0)
X_test$logit_preds <- logit_preds_test
# Model accuracy and confusion matrix for test set
mean(logit_preds_test == y_test)## [1] 0.6220472
confusionMatrix(data=factor(logit_preds_test), y_test)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 62 72
## 1 216 412
##
## Accuracy : 0.622
## 95% CI : (0.5865, 0.6566)
## No Information Rate : 0.6352
## P-Value [Acc > NIR] : 0.7857
##
## Kappa : 0.0835
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.22302
## Specificity : 0.85124
## Pos Pred Value : 0.46269
## Neg Pred Value : 0.65605
## Prevalence : 0.36483
## Detection Rate : 0.08136
## Detection Prevalence : 0.17585
## Balanced Accuracy : 0.53713
##
## 'Positive' Class : 0
##
# Create a data frame with targets and predictions
d_binomial <- tibble("target" = y_test,
"prediction" = logit_preds_test)
# Use evaluate() function to evaluate the predictions and get the confusion matrix tibble
eval <- evaluate(d_binomial,
target_col = "target",
prediction_cols = "prediction",
type = "binomial")
# Plot the eval output that contains the confusion matrix tibble
plot_confusion_matrix(eval$`Confusion Matrix`[[1]],
add_normalized=FALSE)The target represents the true class of the observations; in our case, the target retained_8wks can be either 0 or 1. The prediction represents the predicted class by the model.
There are 4 different scenarios depicted by each tile of the confusion matrix:
In the middle of each tile, we have the count of the number of observations who fall within that tile. In other words, tile 1 tells us that the model has correctly identified 428 users who remain active on the app for at least 8 weeks.
At the bottom of each tile, we have the column percentage. That is, out of all the observations where Target is 1, 85.8% of them were predicted to be 1 and 14.2% were predicted to be 0.
At the right side of each tile, we have the row percentage. I.e. out of all the observations where Prediction is 1, 66.3% of them were actually 1 while 33.7% were actually 0.
Also, the color intensity visually represents the count values of each tile. The darker the color of a tile, the higher the count of observations in that tile.
We can observe that the model is better at identifying the users who will remain active on the app after the first 8 weeks (see the column percentage in Tile 1) rather than identifying the users who will stop using the app within the first 8 weeks (see the column percentage in Tile 4). One possible explanation for this is that the dataset the model is trained on is unbalanced. As can be observed from the confusion matrix for the training set above, there are 1,114 users who remain on the app after the first 8 weeks and 675 users who stop using the app within the first 8 weeks in the training set.
Furthermore, whenever you are working on a classification problem, it is highly important to determine which metric is important for the specific problem you are trying to solve. Accuracy will not always be the appropriate metric to use for selecting the best model. To determine the appropriate metric(s) to use for evaluating this logistic regression model, we first need to understand the differences between the metrics (in this case, accuracy, precision, and recall).
Accuracy is simply the number of correct predictions divided by the total number of predictions. Mathematically, \[Accuracy = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} = \frac{TP + TN}{TP+TN+FP+FN}\] where TP stands for True Positive, TN stands for True Negative, FP stands for False Positive, FN stands for False Negative.
Precision is the number of actual positives the model captures out of all the predicted positives. Mathematically, \[Precision = \frac{\text{True positives}}{\text{Total predicted positives}} = \frac{TP}{TP+FP}\]
Recall is the number of actual positives the model captures through labeling them as Positive (True Positive). Mathematically, \[Recall = \frac{\text{True positives}}{\text{Total actual positives}} = \frac{TP}{TP+FN}\]
We select the best model by analyzing the accuracy, precision, and/or recall scores depending on which type of incorrect prediction has a higher cost associated with it in our classification problem. For example, if our classification problem is to detect whether a patient is sick or not (where the label of patient being sick is 1) then the cost associated with a False Negative is higher than the cost associated with a False Positive because not detecting that a patient is sick (when the patient is actually sick) is worse than classifying a healthy patient as sick. So, in this case, we would opt for using recall as a metric to select the best model.
We now build a regularized logistic regression model that prevents the model from over-fitting by penalizing the model coefficients, similar to the regularized linear regression models we built in the Intro to ML tutorial. Specifically, we build a lasso logistic regression, which penalizes the sum of squared coefficients. However, you are free to build a different type of regularized logistic model, such as the ridge and/or elastic-net (which is a combination of ridge and lasso) logistic models.
We then perform cross-validation to find the optimal lambda (i.e. regularization parameter) and evaluate the performance of the lasso logistic model with this optimal lambda on the test set. As discussed in the Intro to ML tutorial, cross-validation is a technique that is used to evaluate the performance of machine learning models by making predictions on different subsets of the data and tuning the model parameters (i.e. by finding the model parameters that minimize the cross-validation error).
# Dummy code categorical predictor variables
X_cv_train <- model.matrix(retained_8wks~., train_lm)[,-1]
# Find the best lambda using cross-validation
cv_logit <- cv.glmnet(X_cv_train, y_train, alpha = 0, family = "binomial") # alpha=0 for lasso logistic regression
# Fit the final model on the training data
cv_lasso_logit_model <- glmnet(X_cv_train, y_train, alpha = 0, family = "binomial",
lambda = cv_logit$lambda.min)
coef(cv_lasso_logit_model)## 31 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 0.9263578089
## n_stories_firstwkly_session -0.0567026283
## avg_n_stories_firstwk -0.0738407499
## n_sessions_firstwk 0.0214783262
## unique_stories_firstwk -0.0169717188
## unique_sources_firstwk -0.0104768144
## n_stories_first2wkly_sessions -0.0207675026
## avg_n_stories_first2wkly_sessions -0.0062087013
## avg_n_stories_first2wks -0.0066304412
## n_sessions_first2wks -0.0049011005
## unique_stories_first2wks 0.0403877987
## unique_sources_first2wks -0.0009237574
## n_stories_first3wkly_sessions -0.0369599132
## avg_n_stories_first3wkly_sessions 0.0157855865
## avg_n_stories_first3wks -0.0203042075
## n_sessions_first3wks 0.0852682359
## unique_stories_first3wks 0.0201288872
## unique_sources_first3wks 0.0217728294
## grade.Grade.1 0.1606046023
## grade.Grade.2 0.1134581766
## grade.Grade.3 -0.1271632849
## grade.Grade.4 -0.2290694602
## grade.Grade.5 -2.2871185363
## grade.Grade.6 -1.2940856985
## grade.Junior.KG -0.1008569303
## grade.Nursery 0.0851284051
## grade.Senior.KG 0.2200117236
## user.b2b 0.5942161514
## user.b2c -0.6931871750
## user.expired -0.3551138385
## user.paid 0.5979662945
# Make predictions on the test data
X_cv_test <- model.matrix(retained_8wks ~., test_lm)[,-1]
cv_logit_probs_test <- cv_lasso_logit_model %>% predict(newx = X_cv_test)
cv_logit_preds_test <- ifelse(cv_logit_probs_test > 0.5, "1", "0")
# Model accuracy and confusion matrix for test set
mean(cv_logit_preds_test == y_test)## [1] 0.5839895
confusionMatrix(data=factor(cv_logit_preds_test), y_test)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 206 245
## 1 72 239
##
## Accuracy : 0.584
## 95% CI : (0.5481, 0.6193)
## No Information Rate : 0.6352
## P-Value [Acc > NIR] : 0.9984
##
## Kappa : 0.2073
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7410
## Specificity : 0.4938
## Pos Pred Value : 0.4568
## Neg Pred Value : 0.7685
## Prevalence : 0.3648
## Detection Rate : 0.2703
## Detection Prevalence : 0.5919
## Balanced Accuracy : 0.6174
##
## 'Positive' Class : 0
##
# Create a data frame with targets and predictions
d_binomial <- tibble("target" = y_test,
"prediction" = cv_logit_preds_test)
# Use evaluate() function to evaluate the predictions and get the confusion matrix tibble
eval <- evaluate(d_binomial,
target_col = "target",
prediction_cols = "prediction",
type = "binomial")
# Plot the eval output that contains the confusion matrix tibble
plot_confusion_matrix(eval$`Confusion Matrix`[[1]],
add_normalized=FALSE)By regularizing the model coefficients, we get an accuracy score of ~59.62%, which is ~3% lower than the accuracy of the unregularized logistic regression model.
Furthermore, we observe that the lasso logistic model is better than the unregularized logistic model at identifying the users who will stop using the app within their first 8 weeks (i.e. target=0 and prediction=0 column percentage increases from 22.4% to 77.6%). However, it is worse at identifying the users who will remain active on the app after the first 8 weeks (i.e. target=1 and prediction=1 column percentage decreases from 85.8% to 49.5%). It is up to you as the analyst to determine the main goal(s) of building a predictive model and decide what metrics to optimize accordingly, which can involve trade-offs between different performance metrics.
Think about potential approaches you could take to improve the model performance. I.e. you could balance the dataset (see some techniques outlined here) or redefine the question we introduced at the beginning of this section that motivated our process of building predictive models.
Which metric would you use to select the best model in our classification problem? Why?
As we have seen above, any interpretation needs to take into account the joint distribution of covariates. One possible heuristic is to consider data-driven subgroups. For example, we can analyze what differentiates observations whose predicted probabilities are high from those whose predicted probabilities are low.
The following heatmap visualizes the results for the user quartiles created based on the predicted classes (either 0 or 1).
# Table of covariates means/sd by n.tile. The n.tile is the variable that
# defines the subgroups here, which needs to be done before running the function.
plot_covariate_means_by_ntile <- function(.df, .ntile = "ntile", covariate_names, n_top = 10) {
.df <- as.data.frame(.df)
covariate_names <- covariate_names
.df[, .ntile] <- as.factor(.df[, .ntile])
# Regress each covariate on ntile/subgroup assignment to means p
cov_means <- lapply(covariate_names, function(covariate) {
lm_robust(as.formula(paste0(covariate, " ~ 0 + ", .ntile)), data = .df, se_type = "stata")
})
# Extract the mean and standard deviation of each covariate per ntile/subgroup
cov_table <- lapply(cov_means, function(cov_mean) {
means <- as.data.frame(t(coef(summary(cov_mean))[,c("Estimate", "Std. Error")]))
means
})
# Preparation to color the chart
temp_standardized <- sapply(seq_along(covariate_names), function(j) {
covariate_name <- covariate_names[j]
.mean <- mean(.df[, covariate_name], na.rm = TRUE)
.sd <- sd(.df[, covariate_name], na.rm = TRUE)
m <- as.matrix(round(signif(cov_table[[j]], digits=4), 3))
.standardized <- (m["Estimate",] - .mean) / .sd
.standardized
})
colnames(temp_standardized) <- covariate_names
ordering <- order(apply(temp_standardized, MARGIN = 2, function(x) {.range <- range(x); abs(.range[2] - .range[1])}), decreasing = TRUE)
# fwrite(tibble::rownames_to_column(as.data.frame(t(temp_standardized)[ordering,])),
# paste0(directory$data, "/covariate_standardized_means_by_", .ntile, ".csv"))
color_scale <- max(abs(c(max(temp_standardized, na.rm = TRUE), min(temp_standardized, na.rm = TRUE))))
color_scale <- color_scale * c(-1,1)
max_std_dev <- floor(max(color_scale))
breaks <- -max_std_dev:max_std_dev
labels <- c(" ", breaks, " ")
breaks <- c(min(color_scale), breaks, max(color_scale))
# Little trick to display the standard errors
table <- lapply(seq_along(covariate_names), function(j) {
covariate_name <- covariate_names[j]
.mean <- mean(.df[, covariate_name], na.rm = TRUE)
.sd <- sd(.df[, covariate_name], na.rm = TRUE)
m <- as.matrix(round(signif(cov_table[[j]], digits=4), 3))
.standardized <- (m["Estimate",] - .mean) / .sd
return(data.frame(covariate = covariate_name,
group = 1:ncol(m),
estimate = m["Estimate",], std.error = m["Std. Error",],
standardized = .standardized))
})
# table <- do.call(rbind, table)
table <- rbindlist(table)
setnames(table, "group", .ntile)
table[, covariate := factor(covariate, levels = rev(covariate_names[ordering]), ordered = TRUE)]
table[covariate %in% head(covariate_names[ordering], n_top)] %>%
mutate(info = paste0(estimate, "\n(", std.error, ")")) %>%
ggplot(aes_string(x = .ntile, y = "covariate")) +
# Add coloring
geom_raster(aes(fill = standardized)
, alpha = 0.9
) +
scale_fill_distiller(palette = "RdBu",
direction = 1,
breaks = breaks,
labels = labels,
limits = color_scale,
name = "Standard\nDeviation on\nNormalized\nDistribution"
) +
# add numerics
geom_text(aes(label = info), size=2.1) +
# reformat
labs(title = paste0("Average covariate values within group based on binary prediction ranking (cutoff of 0.5)"),
y = "within covariate") +
scale_x_continuous(position = "top") +
scale_x_discrete(position = "top", limits=c("0","1")) +
theme(plot.title = element_text(size = 8, face = "bold"))
#cowplot::theme_minimal_hgrid(16)
}
# Now use the function to get average covariates for user subgroups by binary prediction ranking
X_train <- X_train %>% mutate(quantile_probs = ntile(X_train$logit_probs, 4))
train_data<-as.data.frame(cbind(X_train, features_train$retained_8wks))
colnames(train_data)<-c(colnames(X_train),"retained_8wks")
X_test <- X_test %>% mutate(quantile_probs = ntile(X_test$logit_probs, 4))
test_data<-as.data.frame(cbind(X_test, features_test$retained_8wks))
colnames(test_data)<-c(colnames(X_test),"retained_8wks")
data = rbind(train_data, test_data) %>%
select(-c(quantile_probs,
logit_probs))
covar_names = colnames(data)[colnames(data) != "logit_preds"]
plot_covariate_means_by_ntile(data,
.ntile = "logit_preds",
covar_names,
n_top = 20 # <- this can be changed to any number
)The following heatmap visualizes the results for the user quartiles created based on the predicted class probabilities.
# Table of covariates means/sd by n.tile. The n.tile is the variable that
# defines the subgroups here, which needs to be done before running the function.
plot_covariate_means_by_ntile <- function(.df, .ntile = "ntile", covariate_names, n_top = 10) {
.df <- as.data.frame(.df)
covariate_names <- covariate_names
.df[, .ntile] <- as.factor(.df[, .ntile])
# Regress each covariate on ntile/subgroup assignment to means p
cov_means <- lapply(covariate_names, function(covariate) {
lm_robust(as.formula(paste0(covariate, " ~ 0 + ", .ntile)), data = .df, se_type = "stata")
})
# Extract the mean and standard deviation of each covariate per ntile/subgroup
cov_table <- lapply(cov_means, function(cov_mean) {
means <- as.data.frame(t(coef(summary(cov_mean))[,c("Estimate", "Std. Error")]))
means
})
# Preparation to color the chart
temp_standardized <- sapply(seq_along(covariate_names), function(j) {
covariate_name <- covariate_names[j]
.mean <- mean(.df[, covariate_name], na.rm = TRUE)
.sd <- sd(.df[, covariate_name], na.rm = TRUE)
m <- as.matrix(round(signif(cov_table[[j]], digits=4), 3))
.standardized <- (m["Estimate",] - .mean) / .sd
.standardized
})
colnames(temp_standardized) <- covariate_names
ordering <- order(apply(temp_standardized, MARGIN = 2, function(x) {.range <- range(x); abs(.range[2] - .range[1])}), decreasing = TRUE)
# fwrite(tibble::rownames_to_column(as.data.frame(t(temp_standardized)[ordering,])),
# paste0(directory$data, "/covariate_standardized_means_by_", .ntile, ".csv"))
color_scale <- max(abs(c(max(temp_standardized, na.rm = TRUE), min(temp_standardized, na.rm = TRUE))))
color_scale <- color_scale * c(-1,1)
max_std_dev <- floor(max(color_scale))
breaks <- -max_std_dev:max_std_dev
labels <- c(" ", breaks, " ")
breaks <- c(min(color_scale), breaks, max(color_scale))
# Little trick to display the standard errors
table <- lapply(seq_along(covariate_names), function(j) {
covariate_name <- covariate_names[j]
.mean <- mean(.df[, covariate_name], na.rm = TRUE)
.sd <- sd(.df[, covariate_name], na.rm = TRUE)
m <- as.matrix(round(signif(cov_table[[j]], digits=4), 3))
.standardized <- (m["Estimate",] - .mean) / .sd
return(data.frame(covariate = covariate_name,
group = 1:ncol(m),
estimate = m["Estimate",], std.error = m["Std. Error",],
standardized = .standardized))
})
# table <- do.call(rbind, table)
table <- rbindlist(table)
setnames(table, "group", .ntile)
table[, covariate := factor(covariate, levels = rev(covariate_names[ordering]), ordered = TRUE)]
table[covariate %in% head(covariate_names[ordering], n_top)] %>%
mutate(info = paste0(estimate, "\n(", std.error, ")")) %>%
ggplot(aes_string(x = .ntile, y = "covariate")) +
# Add coloring
geom_raster(aes(fill = standardized)
, alpha = 0.9
) +
scale_fill_distiller(palette = "RdBu",
direction = 1,
breaks = breaks,
labels = labels,
limits = color_scale,
name = "Standard\nDeviation on\nNormalized\nDistribution"
) +
# add numerics
geom_text(aes(label = info), size=2.1) +
# reformat
labs(title = paste0("Average covariate values within group based on quartiles of predicted probabilities"),
y = "within covariate") +
scale_x_continuous(position = "top") +
# scale_x_discrete(position = "top", limits=c("0","1")) +
theme(plot.title = element_text(size = 8, face = "bold"))
#cowplot::theme_minimal_hgrid(16)
}
# Now use the function to get average covariates for user subgroups by quartiles of predicted probabilities
train_data<-as.data.frame(cbind(X_train, features_train$retained_8wks))
colnames(train_data)<-c(colnames(X_train),"retained_8wks")
test_data<-as.data.frame(cbind(X_test, features_test$retained_8wks))
colnames(test_data)<-c(colnames(X_test),"retained_8wks")
data = rbind(train_data, test_data) %>%
select(-c(logit_preds,
logit_probs))
covar_names = colnames(data)[colnames(data) != "quantile_probs"]
plot_covariate_means_by_ntile(data,
.ntile = "quantile_probs",
covar_names,
n_top = 20 # <- this can be changed to any number
) An interesting observation from this heatmap is that half of the users in Quartile 4, the one with the highest predicted probabilities, are B2B users.
Observing the heatmaps above, what other interesting insights can you obtain for user subgroups based on quartiles of predicted probabilities and/or based on binary predicted rankings?
Let’s look at the average true retention rate of users in the test set by dividing them into subgroups based on quartiles of predicted probabilities. This allows us to understand how well the predicted probabilities match the actual probabilities of remaining on the app. In our case, the actual probability of remaining on the app can be either 0 (if the user stops using the app within the first 8 weeks) or 1 (if the user remains on the app for at least the first 8 weeks).
Note that we use the predicted probabilities obtained from the unregularized logistic model for this calibration analysis. You are encouraged to perform similar calibration analysis for other models.
# Combine the X_test and y_test
test_x_y <-as.data.frame(cbind(X_test, features_test$retained_8wks))
colnames(test_x_y)<-c(colnames(X_test),"retained_8wks")
# Get count, mean, standard deviation, standard error of the mean, and 95% confidence interval (default 95%)
sum_retained <- summarySE(test_x_y, measurevar="retained_8wks", groupvars=c("quantile_probs"))
# Plot the average retention rate for each user quartile
ggplot(sum_retained, aes(x = quantile_probs, y = retained_8wks, fill=quantile_probs)) +
geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin = retained_8wks - ci, ymax = retained_8wks + ci),
width=.2, position=position_dodge(.9)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Quartiles based on predicted probabilities") +
ylab("Average true retention rate") +
scale_fill_distiller(name="Quartiles based on predicted probabilities") Each bar in the graph shows the average true retention rate (calculated using the dependent variable
retained_8wks) by user subgroup. We generate the user subgroups by diving users into quartiles based on their predicted probabilities of remaining on the app after the first 8 weeks. For example, the users in Quartile 1 have the lowest predicted probabilities whereas the users in Quartile 4 have the highest predicted probabilities of remaining on the app after the first 8 weeks. The graph shows that the predicted probabilities of remaining on the app are better calibrated for users in Quartiles 3 and 4 than for users in Quartiles 1 and 2. Model predictions for users in Quartiles 3 and 4 are better calibrated because the average actual retention rate for users in these quartiles match the predicted retention probabilities, implying that a larger fraction of the users in these quartiles actually remains on the app– just as the model predicts. As for users in Quartiles 1 and 2, the model predictions are not well-calibrated because the actual retention rates do not match the predicted retention probabilities as well.
Observe the heatmaps above and see if there is any insight you can get from the covariates of users in Quartiles 1 and 2 that could explain why the model predictions are not well-calibrated for the users in these quartiles.
We then build a random forest model, which is another type of machine learning model used for binary classification problems.
Random Forest (RF) is a Supervised Learning algorithm that is based on the ensemble learning method and many Decision Trees. An ensemble learning model, such as Random Forest, consists of many base models, which are used to make the final model predictions more accurate than those of any individual base model. Furthermore, in RF models, all computations are performed in parallel and the Decision Tree models do not interact with each other while they are being built. To build these separate Decision Tree models, RF uses a unique subset of the initial data for every base model, which helps make the Decision Tree models (and thus, their predictions) less correlated. In addition, the RF model uses a random set of features to split each node in every Decision Tree. Hence, none of the Decision Tree models see the entire training data, which allows RF models to capture the general patterns of the data better and reduce the model sensitivity to noise. Thus, RF models reduce overfitting by using a subset of features and performing feature randomization for each Decision Tree.
# Create the dependent variable 'retained_8wks'
y_train <- factor(features_train$retained_8wks)
y_test <- factor(features_test$retained_8wks)
# `max_sessions_since_signup` and `max_days_since_signup` are perfectly correlated with `exited_8wks` as we use the former to create the latter indicator variable
X_train <- data.frame(features_train_dummies) %>% select(-c(max_sessions_since_signup,
max_days_since_signup,
child_id,
sessions_since_signup))
X_test <- data.frame(features_test_dummies) %>% select(-c(max_sessions_since_signup,
max_days_since_signup,
child_id,
sessions_since_signup))
# Create a training set dataframe as an input to the random forest model
train_rf<-as.data.frame(cbind(X_train, y_train))
colnames(train_rf)<-c(colnames(X_train),"retained_8wks")
# Create a test set dataframe to be used after building the random forest model for performance evaluation
test_rf<-as.data.frame(cbind(X_test, y_test))
colnames(test_rf)<-c(colnames(X_test),"retained_8wks")
# Create a formula from a string; i.e. "y~ x1 + x2 + x3" where y=y_train and X=X_train
sumx <- paste(colnames(X_train), collapse = " + ")
form <- paste("y_train",sumx, sep=" ~ ")
form <- as.formula(form)
# Build a random forest model to predict whether a user will churn/exit after 8 weeks
rf_model <- randomForest(form, data = train_rf)We now evaluate the performance of this random forest model using the accuracy, precision, and recall scores (among other scores) on the test set.
# Predict and evaluate on training set
rf_probs_train <- predict(rf_model, train_rf)
rf_accuracy_train <- table(rf_probs_train, y_train)
sum(diag(rf_accuracy_train))/sum(rf_accuracy_train)## [1] 1
# Predict and evaluate on test set
rf_probs_test <- predict(rf_model, test_rf, type="response")
rf_accuracy_test <- table(rf_probs_test, y_test)
sum(diag(rf_accuracy_test))/sum(rf_accuracy_test)## [1] 0.6325459
confusionMatrix(data=factor(rf_probs_test), y_test)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 86 88
## 1 192 396
##
## Accuracy : 0.6325
## 95% CI : (0.5972, 0.6669)
## No Information Rate : 0.6352
## P-Value [Acc > NIR] : 0.5759
##
## Kappa : 0.1386
##
## Mcnemar's Test P-Value : 7.488e-10
##
## Sensitivity : 0.3094
## Specificity : 0.8182
## Pos Pred Value : 0.4943
## Neg Pred Value : 0.6735
## Prevalence : 0.3648
## Detection Rate : 0.1129
## Detection Prevalence : 0.2283
## Balanced Accuracy : 0.5638
##
## 'Positive' Class : 0
##
# preedictions passed in the evaluate() function have to be numeric values
rf_probs_test_ints <- as.integer(as.character(rf_probs_test))
# create a data frame with targets and predictions
d_binomial <- tibble("target" = y_test,
"prediction" = rf_probs_test_ints)
# use evaluate() function to evaluate the predictions and get the confusion matrix tibble
eval <- evaluate(d_binomial,
target_col = "target",
prediction_cols = "prediction",
type = "binomial")
# plot the eval output that contains the confusion matrix tibble
plot_confusion_matrix(eval$`Confusion Matrix`[[1]],
add_normalized=FALSE)As can be observed from the accuracy scores of the random forest model on the training set and the test set, 100% vs. 63.25%, the model performs significantly better on the training set than any of the other models. However, compared to its performance on the training set, the model performs significantly worse on its test set. In addition, the accuracy score on the test set is comparable to the accuracy scores of the other models previously built. This clearly indicates that the RF model overfits on the training set, but does not generalize well to data points outside of the training set (in this case, the test set).
Using the confusion matrix, we can see that out of all the observations where Target is 1, 79.6% of them were predicted to be 1 and 20.4% were predicted to be 0. On the other hand, out of all the observations where Prediction is 1, 65.5% of them were actually 1 while 34.5% were actually 0.
Similar to the general pattern observed in the unregularized logistic regression model, the random forest model is better at predicting the users who will remain active on the app after the first 8 weeks (see the column percentage in Tile 1) than predicting the users who will stop using the app within the first 8 weeks (see the column percentage in Tile 4). In addition, although the accuracy scores of the logistic and RF models are very similar (62.95% vs. 63.33%, respectively) on a relative basis, the RF model is better at predicting the users who will stop using the app within the first 8 weeks and the logistic model is better at predicting the users who will remain active on the app after the first 8 weeks.
We can observe that the model is better at identifying the users who will remain active on the app after the first 8 weeks (see the column percentage in Tile 1) rather than identifying the users who will stop using the app within the first 8 weeks (see the column percentage in Tile 4). One possible explanation for this is that the dataset the model is trained on is unbalanced. As can be observed from the confusion matrix for the training set above, there are 1,114 users who remain on the app after the first 8 weeks and 675 users who stop using the app within the first 8 weeks in the training set.
Similar to the overall pattern observed in the logistic regression model, we can observe that, overall, the random forest model is better at identifying the users who will remain active on the app after the first 8 weeks (see the column percentage in Tile 1) rather than identifying the users who will stop using the app within the first 8 weeks (see the column percentage in Tile 4).
In addition, although the accuracy scores of the logistic and RF model are very similar (62.95% vs. 63.33%, respectively), on a relative basis, the RF model is better at identifying the users who actually stopped using the app within the first 8 weeks and the logistic model is better at identifying the users who actually remain active on the app after the first 8 weeks.
Use the confusion matrices of both models to derive other result interpretations and try to come up with potential explanations about your observations. # Exercise 10
Discuss with your team and decide which metric you will use to select the best model and why.
Now that we have guided you through the steps of building a baseline model using 2 different ML algorithms in order to predict whether a user will remain engaged on the app after the first 8 weeks, you can try to build your own model using a different ML algorithm or using different input variables and/or dependent variable or you can try to improve upon the models we have provided you with.